Data Warehousing Concepts & Architectures

advertisement
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
• Data warehouse architecture
The hidden slides of this slideshow may be important.
However, I will focus on leaning by exercises and therefore, rattling off
new concepts are often done in hidden slides.
OLTP versus OLAP
OLTP = On Line Transaction Processing
OLAP = On Line Analytical Processing
OLTP
OLAP
users
clerk, IT professional
knowledge worker/decision makers
function
day to day operations
decision support
DB design
application-oriented
data
current, up-to-date
detailed, flat relational
isolated data.
repetitive
subject-oriented
(Business functions)
historical,
summarized, multidimensional
integrated, consolidated data.
ad-hoc
usage
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
access
complex query
An example of a Datawarehouse:
A star shema datawarehouse has a central table (the Fact table)
surrouded by dimension tables with on-to-many relationships towards
the fact table.
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
The fixed data base structure
implies that application programs
(drilling functions/aggregates) can
be generated automatically!
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Dimension hierarchies:
A dimension hierarchy is a set of tables connected by one-to-many
relationships towards the fact table:
Fact table
Orderdetails
- Product#
- Order#
- Qty
- Price
Dimension hierarchy
Orders
- Order#
- Customer#
- Date
Customers
- Customer#
- Customer-name
In a dimension hierarchiy it is possible to aggregate data from the fact
table to the different levels of the hierachy.
Roll-up = aggregate along one or more dimensions.
Drill-down = “de-aggregate” = break an aggregate into its constituents.
Two different types of drilling:
-Drilling in dimension hierarchies
-Drilling between dimensions.
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Which star
schemas or
data marts
can be build
by using the
illustrated
integrated
E-commerce/
ERP data
model?
Which star
schema would
you
recommend
to be
implemented
first?
Product
Product#
ProductName
Price
Order-Detail
Product#
Order#
Qty
Price
Timestamp
Order-DetailHistory
Inv-Item#
Order#
Seq#
State
Timestamp
InvoyceHistory
Invoice#
Timestamp
State
Notes
Product-Stock
Product#
Location#
Qty
Order
Order#
OrderDate
Balance
State
Shipping
Shipping#
ShipMethod
ShipCharge
State
ShipDate
Address
Address#
Name
Add1
Add2
City
State
Zip
Shipping
Invoice
Invoice#
CreationDate
Location
Location#
Address
Customer
Customer#
Kredit-Limit
Balance
UserSession
Session#
IPaddress
#Click
Timestamp
UserAccount
Salesman#
PassWord
Timestamp
#visits
#trans
Ttl-tr-amount
Payment
Payment#
Ammount
State
Timestamp
Billing
CreditCard
Card#
HolderName
ExpireDate
A galaxy is a set of star fact tables
with conformed (fælles tilpassede)
dimensions:
Time dimension hierarchy
Year
- yy
Month
- yy
- mm
Day
Fact table
The value chain
SaleOrderdetails
- Product#
- Sale-order#
- Qty
- Discount
- Sale-price
- Date#
- yy
- mm
- dd
Fact table
- Date#
Storage-per- Qty
product
Fact table
Purchaseorderdetails
- Product#
- Date#
- End-of-daystorage-qty
- Product#
- Purchase-order#
- Purchase-price
- Qty
- Date#
Products
- Product#
- Product-name
Product
groups
- Product-group#
- Product-group-name
Product dimension hierarchy
Conceptual Modeling of Data Warehouses
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Galaxy schema: Multiple fact tables share dimension tables
(Conformed dimensions), viewed as a collection of stars,
therefore called galaxy schema or fact constellation
The aggregating level is the argument to the GROUP BY statement:
SELECT Product#, SUM(Qty*Price) AS Turnover
FROM Orderdetails JOIN Products
GROUP BY Product#
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Drill down to the Product per Salesman level:
SELECT Product#, Salesman#, SUM(Qty*Price) AS Turnover
FROM Orderdetails JOIN Products JOIN Salesmen
GROUP BY Product#, Salesman#;
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
Where should the
Price be stored?
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Snowflake schema with branches:
A Snowflake schema may have branches in the dimension hierarchies:
Fact table
Dimension hierarchy
Orderdetails
Orders
- Product#
- Order#
- Qty
- Order#
- Customer#
- Date
Products
-
Customers
Product#
Product-name
Price
Group#
- Customer#
- Customer-name
Snowflake hierarchy
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Product
groups
- Group#
- Group-name
- Department#
Departments
- Department#
- Department-name
Dimension hierarchy
Regions
- Region#
- Region-name
Are Customers related to the Regions?
Drilling in dimension hierarchies:
Dimension hierarchy
Fact table
- Product#
- Order#
- Qty
Customers
Orders
Orderdetails
- Customer#
- Customer-name
- Order#
- Customer#
- Date
Snowflake hierarchy
Products
-
Product#
Product-name
Price
Group#
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Product
groups
- Group#
- Group-name
- Department#
Departments
Dimension hierarchy
Salesman#
Turnover
Branch-office#
Branch-office#
Turnover
Smith
100,000
LA
LA
400,000
Jones
300,000
LA
SF
200,000
Adams
200,000
SF
Drilling between dimension hierarchies:
Dimension hierarchy
Fact table
Customers
Orders
Orderdetails
- Product#
- Order#
- Qty
- Customer#
- Customer-name
- Order#
- Customer#
- Date
Snowflake hierarchy
Products
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Sales
man#
Productname
Turnover
Branchoffice#
Salesman#
Turnover
Branchoffice#
Smith
Screw
10,000
LA
Smith
100,000
LA
Smith
Bolt
30,000
LA
Jones
300,000
LA
Smith
Nut
60,000
LA
Adams
200,000
SF
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
...
Roll up to the top level:
Sales
man#
Productname
Turnover
Branchoffice#
Smith
Screw
10,000
LA
Smith
Bolt
30,000
LA
Smith
Nut
60,000
LA
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
Roll up can be executed by
removing one or more argument to
the GROUP BY statement.
...
Productname
Turnover
Screw
100.000
Bolt
200.000
Nut
300,000
Top level
Roll up to the product level.
Turnover
600.000
Roll up to the top level.
Dimension
Orders
The aggregation level is
the argument to the
GROUP BY statement.
- Order#
- Ordertype
Dimension
Fact table
Products
Dimension
Orderdetails
- Product#
- Product-name
- Price
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
- Branch-Office#
Dimension
Time
- Date#
- Date-Name
Salesman#
Productname
Turnover
Branch-office#
Smith
Screw
10,000
LA
Smith
Bolt
30,000
LA
Smith
Nut
60,000
LA
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
Aggregated data
Non-aggregated data
...
x1
x2
…
xn
Dimension hierarchies:
A dimension hierarchi is a set of tables connected by one-to-many relationships towards
the fact table:
Fact table
Orderdetails
- Product#
- Order#
- Qty
- Price
Dimension hierarchy
Orders
- Order#
- Customer#
- Date
Customers
- Customer#
- Customer-name
A Snowflake schema may in contrast to star schemas have
dimension hierarchies.
Describe the advantage/disadvantage of using dimension hierarchies
or Snowflake schema?
Exercise:
Customers
Branch
offices
Orders
Contracts
Pick up
Reservations
The figure
illustrates an
ER-diagram of a
car rental
company like
Hertz or Avis.
Car return
Question 1.
Design a star schema or
Galaxy for the car rental
company.
Question 2.
Is there advantages by
storing suppliers as
customers in e.g. an ecommerce data warehouse?
Cars
Car types
Garage
services
Garages
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
• Data warehouse architecture
Data Models
– Relational models/ER-diagram used for OLTP databases
– Stars, snowflakes and galaxies used for OLAP databases
– Cubes used for OLAP databases
Et star schema DW can be illustrated as a multidimensinal cube:
Describe advantages/disadvantages of storing data in a cube in memory?
OLAP Cube operations:
OLAP operations:
Roll Up = Aggregatin to a higer level. For example from
month to year)
Drill Down = recalculation with more details
Slice = Selecting a subset by using a fixed dimension value.
Drill Across = Join of fact data across conformed dimensions
Drill Through = Accessing related data from a OLTP system.
Aggregating
Pivoting = See next slide!
Pivoting =
Transforming SQL query output to user friendly two dimensional screen layout
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
8
c2
4
8
c3
50
OLAP Server Architectures
• Relational OLAP (ROLAP)
– Use relational or extended-relational DBMS to store and manage
warehouse data
– Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
– greater scalability
• Multidimensional OLAP (MOLAP)
– Array-based multidimensional storage engine (sparse matrix techniques)
– fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)
– Storage flexibility with mix of ROLAP and MOLAP
• POLAP personel HOLAP
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modeling
• Data warehouse design/implementation architectures
1. Kimball has a bottom-up architecture
2. Inmon has a top-down architecture
3. Data Vault architecture is normalized tables extended with historic
data tables. That is, the Data Vault can be used to generate any data
mart when needed.
Kimball’s Bottom-Up DW architecture:
Kimball’s architecture uses conformed dimensions and conformed facts.
Conformed dimensions makes it possible to drill across from one data
mart to another to present data from different marts in the same view.
Only the conformed data have top-down design.
Kimball’s Data Warehousing Architecture
Query side
ETL side
Metadata
Data
sources
Presentation servers
- Extract
- Transform
- Load
Data Staging
Area
Query
Services
-Warehouse Browsing
-Access and Security
Data marts with
aggregate-only data
-Query Management
- Standard Reporting
Conformed -Activity Monitor
Data
Warehouse
Bus
Reporting Tools
Desktop Data
Access Tools
Data mining
dimensions
and facts
Data marts with
atomic data
Surrogate key (Surrogatnøgle) = A sequense number used as primary key.
William Inmon’s
DATA WAREHOUSE architecture from 1990
has top-down design without conformed data.
and:
EDS = Enterprise
Data Warehouse.
Department
datawarehouses
The DSA (Data Staging Area) where transformation takes
place is not illustrated.
The DATA VAULT architecture from 2002-2005 has
full top-down design and buttom up implementation:
Normalized Data Vault
with historic data
In the Data Vault database with historic information only the Extract
activity has taken place.
Therefore, the Data Vault architecture is not drowned in the design phase.
Classical Data warehousing
1
Source
OLTP
Extraction
Delta
Detection
2
DSA
Cleansing
3
EDW
Transformation
Business
Rules
DSA = Data Staging Area
EDS = Enterprise Data Warehouse
DM
Filter
Aggregate Error
handling
Classical Data warehousing
1
Source
2
DSA
3
EDW
DM
OLTP
Extraction
Delta
Detection
Cleansing
Transformation
Business
Rules
Filter
Aggregate Error
handling
HANA from SAP is an In memory Data Warehouse product
1
Source
OLTP
Delta
Cleansing
Detection
Transformation
DSA
Business
Rules
Error
handling
Aggregate
Filter
Extraction
Classical Data warehousing
1
Source
Extraction
2
DSA
Delta
Detection
Cleansing
3
EDW
Transformation
Business
Rules
DM
Filter
Aggregate Error
handling
In memory Data warehousing
1
OLTP
Source
Delta
Cleansing
Detection
Transformation
DSA
Business
Rules
How can OLTP and OLAP
be integrated in a common
In Memory database?
Error
handling
Aggregate
Filter
Extraction
Patients_____
Patient ID
Name
Address
Patient admits
...
Sympthoms
and test
results
Exercise: Transform the OLTP
Health records database to a Star schema DW
for a Hospital.
Health record subtypes
Diagnoses/
diseases
Prescriptions
Patient
discharges
Treatments
Prescription
lines
Basic Health records are above the dottet line
Conseptual hospital entites in general are below the dottet line
ERdiagram
for a
hospital.
Patient admit
type
Employees
...
Sympthom
types
Disease
types
Treatment
types
Medicin types
Patient
discharges type
Airline
Exercise:
companies
Design an Airline DW.
Airports
Flight routes
Subroutes
Departures
Tickets
Customers
Travel
arrangement
Exercise:
Design a Hotel
DW.
Hotel
chains
Hotels
Rooms
Room
reservations
Services/
tours/
car rentals
Customer
groups
Check-in
periods
Customers
Exercise.
Design a datawarehouse for a travel agency.
Customers
Buyer
Orders
Traveler
Bookings
Reservations
Flight routes/
Room types/
Car types/
service types
Departures/
Hotel rooms/
Car rentals/
etc.
Product
owners
End of session
Thank you !!!
Inmon versus Kimball’s DW definitions:
Kimball and Inmon agree in that OLAP datawarehouses do not use the
OLTP databases. However, what is the difference in the architectures?
Why do you think Kimball’s DW architecture is used most in practice?
Dates may be stored
in different formats.
As an example the
First purchase date
may be stored as a
FK to a hierachical
time dimension and
Birth date as a SQL
time stamp.
Why is different
Date formats used in
the Customer table?
OLAP
• OLAP = On-Line Analytical Processing
– Interaktiv analyse
– Eksplorativ opdagelse
– Kræver hurtige svartider
• Data kan vises som multidimensionelle terninger
– Terninger/kuber kan have et vilkårligt antal dimensioner
– Dimensioner har hierarkier, f.eks. dag-måned-år
• OLAP operationer
– Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT…
– Startniveau, (Kvartal, Produkt)
– Roll Up: mindre detalje, Kvartal->År
– Drill Down : mere detalje, Kvartal->Måned
– Slice: Projektering/selektering, År=1999
– Drill Across: “join” på fælles dimensioner
– Drill Through: Opsøgning af kildedataene i de operative systemer
– Pivoting
The Business Dimensional Lifecycle =
Kimball’s activity model for DATAWAREHOUSE devellopment has
three parallel tracks:
Design af
teknisk
arkitektur
Projekt
planlægning
Specifikation
af
krav
Dimensionel
modellering
Valg af
produkt og
installation
Fysisk
design
Specifikation
af
applikationer
ETL:
design og
udvikling
Udvikling
af
applikationer
Projektledelse
Ibrugtagning
Vedligehold
og vækst
The Data Warehouse Bus Architecture =
Arkitektur for design af en række data marts som tilsammen udgør
virksomhedens data warehouse med fælles conformed dimensions
og conformed facts.
Data marts = afdelings data warehouse. Kimball bruger ordet
mere generelt om en enkelt multidimensional database.
Conformed dimensions = Fælles dimensioner, som er tilpasset
kravere fra flere data marts.
Stovepipe (kakkelovnsrør) =
Skældsord for et data warehouse uden conformed dimensions.
Kimball’s datawarehouse concepts:
Query side
ETL side
Metadata
Data
sources
Presentation servers
-Warehouse Browsing
-Access and Security
Data marts with
aggregate-only data
-Query Management
- Standard Reporting
Conformed -Activity Monitor
Data
- Extract
- Transform
- Load
Data Staging
Area
Query
Services
Warehouse
Bus
dimensions
and facts
Reporting Tools
Desktop Data
Access Tools
Data mining
Operationel
systems
Data marts with
atomic data
Data
Service
Element
Inmonn does not use the conformed facts and dimension table concepts!
In the DATA VAULT Architecture the data marts are
loaded from a normalized database with historic
information.
Existing databases
and systems (OLTP)
New databases
and systems (OLAP)
Appl.
DM
DB
OLAP
Appl.
DB
Appl.
DB
DM
ETL
Data
Vault
…
Appl.
DB
Appl.
DB
Data
mining
DM
Visualization
In the future the DATA VAULT may be the only
database and stored In-Memory.
Appl.
DM
DB
OLAP
Appl.
DB
Appl.
DB
DM
ETL
Data
Vault
…
Appl.
DB
Data
mining
DM
Appl.
DB
SAP has already developed an
In-Memory OLAP database called HANA
Visualization
Download