Introduction to Data Warehousing Introduction to Data Warehousing

advertisement
Introduction to
Data Warehousing
Pasquale LOPS
Gestione della Conoscenza d’Impresa
A.A. 2003-2004
Introduction
Data warehousing and decision support have
given rise to a new class of databases.
Design strategies for OLAP databases differ
significantly from OLTP systems.
Today’s decision support systems must deliver
multidimensional analysis capabilities.
1
Terminology – What is a Data Warehouse?
A database
- typically read-only
- Data stored in relational or multidimensional format
- Multidimensional db often populated from relational db
Populated from existing source systems
- Secondary sources of data
- Populated from existing internal or external data sources
- It is possible to build DSS on top of operational systems
Used for reporting purposes
-
not transaction-based
Used primarily for reporting purposes
Must be designed for analysis purposes
OLAP; not OLTP
Terminology – Decision Support and
Multidimensional Analysis
Decision Support Systems (DSS)
Facilitate business analysis
Support business decision makers by providing various
types of analysis: trend, comparison and ad hoc
reporting
Multidimensional Analysis
Allows analysis by business dimension
Equates to ‘Flexible Reporting’
Allows for drill down, drill up and iterative data analysis
2
Terminology – OLTP and OLAP
OLTP
On-Line Transaction Processing
- support specific application
- Maintain integrity of data
OLAP
On-Line Analytical Processing
- support business analysis
Points of Difference
‹
‹
‹
‹
‹
Orientation or alignment of data
Integration
History—time horizon of data
Data access and manipulation
Usage patterns
OLTP vs. OLAP – Orientation or Alignment of Data
OLTP
OLAP
Organized around Applications
Organized for Business Dimension
‹
Different systems hold different
types of data.
‹
All types of data are integrated
into one system.
‹
Data is inherently organized
by application.
‹
Data is organized by defined
dimensions of the business.
‹
Information from different
systems stored in a single database.
‹Different
system.
information in a different
3
OLTP vs. OLAP – Integration
OLTP
Typically Not Integrated
‹
‹
‹
‹
Different key structures
Different naming conventions
Different file formats
Different hardware platforms
OLAP
Must Be Integrated
‹
‹
‹
‹
Standard key structures
Standard naming conventions
Standard file format
One warehouse server
– Logical server
OLTP vs. OLAP – History
OLTP
Recent or Current Data
‹
‹
‹
‹
‹
60-90 days
Current values only
No time key
No time series analysis
Primary source
OLAP
Historical Data
‹
‹
‹
‹
‹
2 or more years
Historical snapshots of OLTP data
Time key
Time series analysis
Secondary source
– Until data is purged or lost from
OLTP (after 60-90 days)
4
OLTP vs. OLAP – Data Access and Manipulation
OLTP
Transactions
‹
‹
‹
‹
OLAP
Bulk Processes
Inserts, Updates, Deletes,
Selects
Small amount of data involved in
each transaction
Highly ‘indexable’
RDBMS focus
– Locking
– Concurrency
– Logical Unit of Work
‹
‹
‹
‹
Selects only
Large amount of data involved in
each process
Not always ‘indexable’
RDBMS focus
– Parallel Loader, Query
– Star Join
– Bit mapped Indexes
OLTP vs. OLAP – Usage Patterns
OLTP
Fairly Consistent
Maintain a constant system
utilization pattern
OLAP
Spiked or Uneven
Large period of light use and
spiked usage pattern
System Resource Utilization Graphs
5
OLTP vs. OLAP – Summary
Warehouse
Headaches
OLTP
OLAP
Alignment:
Aligned by Application
Aligned by Dimension
Integration:
Typically Not Integrated
Must Be Integrated
History:
Recent or Current Data
Historical Data
Data Access:
Transactions
Bulk Processes
Usage:
Fairly Consistent
Spiked or Uneven
Batch
Maintenance
Tuning
Intro to ERM and ERD
Terms and Concepts
6
ENTITIES
ERM Terminology
ERM - Entity Relationship Model (design)
ERD - Entity Relationship Diagram (graphical)
Entity - things of interest to the business, represented by
boxes and implemented as tables
Attributes - things to know about an entity, implemented as
columns in tables
Relationships - how entities relate, represented by lines on
ERD and implemented as foreign keys
7
Entity Paradigms
Rounded corners - ERD,
Square corners - Relational
Naming
Should be singular in nature
Consistency, communication, compatibility
RELATIONSHIPS
8
Relationships and Business Rules
RX
RX Transaction
Relationship - Line and “crow’s foot” represent a foreign
key relationship from RX Transaction and RX
Cardinality - crow’s foot means “one or more”, absence
means “one”
Relationships and Business Rules
Is allowed by
RX
allows
RX Transaction
Optionality
Solid bar means that the relationship MUST exist
Circle means that relationship MAY exist
Use words near entities with optionality symbols to complete
sentences for definition
RX may allow RX Transactions
RX Transactions must be allowed by an RX
9
Relationships
Be managed by
Department
Vice President
manage
One-to-One relationship:
Each department must be managed by one VP.
Each VP may manage one department.
Relationships
Management
Team
contains
Vice President
Is contained in
One-to-Many relationship:
Each management team must contain many VP’s.
Each VP may be contained in one management team.
10
Relationships
hold
Employee
Degree
Is held by
Many-to-Many relationship:
Each Employee must hold one or more degrees.
Each Degree may be held by one or more employees.
ATTRIBUTES
11
Attributes - Terminology
Attributes are the information we wish to keep
about a particular entity
Example: Inventory
Inventory
Store_id
Item_id
Amt
Units
Attributes - Implementation
Entities and their attributes are shown as:
ENTITY NAME (Attribute1, Attribute2, Attribute3)
Inventory (Store_id, Item_id,Amount, Units)
To specify a primary key for an entity/table, underline the
appropriate Attribute(s)
DRUG (Store_id, Item_id, Amount, Units)
For the purposes of normalization, repeating groups of
attributes may be shown in brackets
SALES (ITEM_ID, DATE, SALES_AMT, {ITEM_NAME, CLASS})
12
Warehouse Architecture Overview
Warehouse Overview
Basic components
Warehouse
Server
Design
Strategies
Source
Systems
Warehouse
Access Tool
13
Warehouse Overview
Designers must consider and understand
unique characteristics and requirements of all
three previous components
Ideally, a project team should pick the best-ofclass tools for storing and accessing data.
In reality, all three pieces should be selected
with regard to the others, to ensure that each
component will complement the others.
Source Systems
One or more operational systems will be the source(s)
of the data stored in the data warehouse.
Source
System
A
Source
System
B
Source
System
C
User
Group A
User
Group B
User
Group C
14
Source Systems (cont’d)
Source systems are typically not integrated.
9Have unique key structures and unique naming conventions
9Possess overlapping data
Source systems hold current value data.
Source systems will indirectly define the scope of a
warehouse.
9Only data found in source systems can be included in data
warehouse; no “new” data can be created.
9Each operational system will have unique characteristics
(levels of detail or granularity of data, types of data or metrics
available)
Warehouse Server
Distributed
Architecture A
DWH
RDBMS
DWH
RDBMS
Distributed
Architecture B
DWH
RDBMS
DWH
RDBMS
HW Platform
(typically UNIX-based)
Gateway
DWH
RDBMS
15
Warehouse Access Tool / Architecture
MOLAP
MDDB
Calls
Client
MDDB
SQL
HW
Platform
Messaging
SQL
App Server
HW
Platform
SQL
DWH
RDBMS
HW
Platfom
ROLAP (2-3 tier)
Design Overview
16
The Warehouse Trade-Off Triangle
Query
Performance
Schema
Data Warehouse
Maintenance
User
Requirements
The ETL Process
ETL = Extraction,
Transformation
and Loading
17
Batch Process – Overview
Source
System
File
Transfer
DWH
RDBMS
Extract
Program
Extract
File
Source System Server
Load
File
Landing
Space
Warehouse Server
Batch Process – Extracts
Extracts are programs that generate
data files.
Source
System
Perform data transformations, data
cleaning.
Perform key conversions.
Extract
Program
Extract
File
Reformat data to the standards of the
warehouse.
Must produce data in a file format
suitable for loading into the data
warehouse (delimiters, capitalization,
etc.).
May build aggregation tables.
Source System Server
18
Batch Process – Extracts
Source
System
Extract
Program
Extract
File
Basic Types of Extracts
1) Facts tables
Must provide load files for the
following tables:
–Base tables
–Historical tables
–Aggregate tables
2) Lookup tables
Must provide data to populate
the following tables:
–Lookup tables
–Relationship tables
Source System Server
Batch Process – Extracts (cont’d)
Static Extraction for the first loading of the DWH
Incremental Extraction for the update of the DWH
19
Batch Process – File Transfers
Source
System
File
Transfer
Extract
Program
Extract
File
Source System Server
Load
File
Landing
Space
Warehouse Server
Batch Process – File Transfers (cont’d)
File Transfer: Process of moving data files to data
warehouse server.
After Extracts, generated files must be moved from source
systems to data warehouse server.
Design Considerations
9Transfer method and network impact
¾Usually transferred via FTP
¾Data volumes are usually large, therefore, the impact on the
network and the transfer rate should be tested and understood
¾Landing space must have enough disk space on the data
warehouse to temporarily store extract files before they are loaded
into the warehouse
20
Batch Process – File Transfers (cont’d)
Design Considerations
9Scheduling routines
¾If there is not enough landing space, scheduling routines must
be designed.
¾Routines must coordinate file transfers and database loads,
transferring a new data file only after an existing file has been
loaded and is no longer needed.
Batch Process – Data Loads
Source
System
File
Transfer
DWH
RDBMS
Extract
Program
Extract
File
Source System Server
Load
File
Landing
Space
Warehouse Server
21
Batch Process – Data Loads (cont’d)
Data Load: data loaded from extract file into database.
9Post-load processes
9Aggregation routines
Basic Types of Load Procedures
9Append new records to existing table.
9Drop table and reload updated data file.
9Update existing records.
Batch Process – Data Loads (cont’d)
Post Load Processes
9Must update table indexes after data loads.
9For statistics-based optimizers, must update table and
index statistics after loads.
Aggregation Routines
9Must run aggregation routines if aggregate data
preparation is performed in the data warehouse database.
22
Batch Jobs – Overview
Basic Refresh Jobs: necessary to update tables in
warehouse with current information
9 Lookup Table
9 Fact Table
9 Aggregate Table
Maintenance Jobs: necessary to maintain tables in
warehouse
9 Updating fact data
9 Re-organizing data
Basic Refresh Jobs – Lookup Tables
Purpose
9 Apply changes in existing “organizational” systems
to lookup data in data warehouse.
9 Changes include addition of new items or changes
to descriptive information.
9 No changes to attribute keys or attribute
relationships.
23
Basic Refresh Jobs – Lookup Tables (cont’d)
Basic Methods
9 No refresh
¾Extract is run once to populate DWH
¾Often used in pilot or prototype systems
9 Drop and reload
¾Existing table is dropped or emptied
¾Extract is re-run to capture current information
¾Table is loaded with new extract
9 Append to existing table
¾ Extract is re-run to capture current information
¾ New extract and “old” or “master” lookup file are compared.
¾ New “Delta” file is generated.
¾ Delta is applied to master lookup file and lookup table in warehouse.
¾ Delta file may be loaded into warehouse directly, OR
¾ Delta may be applied to master lookup file and then use Drop and Reload
method, loading the master lookup file.
¾ Sophisticated batch routines, normally used in production
Basic Refresh Jobs – Lookup Tables (cont’d)
Org
Source
System
Extract
File 1/96
Lookup
Table
Extract
Program
Extract
File 2/96
DWH
RDBMS
24
Basic Refresh Jobs – Lookup Tables (cont’d)
Org
Source
System
Master
Lookup
Delta
File
Extract
Program
Compare
Program
Lookup
Table
DWH
RDBMS
Extract
File 2/96
Basic Refresh Jobs – Fact Tables
Purpose
9 Refresh or update fact data in DWH with the new
data from source systems.
Basic methods
9 Bulk or historical insert
¾Extract is run to capture all data existing in source systems
¾Data is bulk-loaded into data warehouse fact tables
¾Simple batch routine
¾Used to “start” warehouse or provide initial data sets
¾Often doesn’t perform any cleansing or integration
9 Drop and reload
9 Append to existing table
25
Basic Refresh Jobs – Fact Tables (cont’d)
Fact
Source
System
Extract
File for
9/95
thru
1/96
Fact
Table
Extract
Program
DWH
RDBMS
Basic Refresh Jobs – Fact Tables (cont’d)
Basic methods
9 Drop and reload
¾Historical or Bulk extract is re-run to capture all available data
¾Existing warehouse table is emptied or truncated
¾File is inserted into empty fact table
¾Simple batch routine
¾Used in prototypes or pilot
¾Not feasible for large data sets
9 Append to existing table
¾Extract is re-run to capture current information
¾New extract is added to “end” of existing fact table
¾Most common method used in production systems
26
Basic Refresh Jobs – Fact Tables (cont’d)
Fact
Source
System
Fact
Table
9/95 - 1/96
Extract
Program
Append
2/96
Extract
File
2/96
DWH
RDBMS
Basic Refresh Jobs – Aggregate tables
Purpose
9 Refresh or aggregate tables
Basic methods
9Aggregate in warehouse RDBMS
¾Produce atomic extract
¾Transfer and load atomic extract into atomic fact table
¾Produce aggregate values using SQL accessing atomic fact table
¾Insert aggregate values into aggregate fact table
9Aggregate in batch (on source systems or warehouse server)
¾Produce atomic extract
¾Transfer and load atomic extract into atomic fact table
¾Produce aggregate extract from atomic extract
¾Transfer and load aggregate extract into aggregate fact table
27
Basic Refresh Jobs – Aggregate Tables (cont’d)
Aggregate
Extract
Source
System
Aggregate
Fact
Table
Aggregate
SQL
Routines
Aggregate
Program
Extract
Program
Atomic
Fact
Table
Atomic
Extract
Source System Server
Warehouse RDBMS
Maintenance Jobs – Updating Fact Data
Purpose
9As changes are made to source system data, they
should be reflected in the data warehouse.
Basic Methods
9Ignore changes.
9Wait until audited data is available.
9Drop and reload day’s extract.
9Capture and apply changes.
9Transfer changes.
28
Maintenance Jobs – Updating Fact Data (cont’d)
Scenario
Sun
Mon
Sun
Pre
Audit
Data
Tue
Wed
Thu
Fri
Sat
Wed
Pre
Audit
Data
Sun
Post
Audit
Data
Audit Process produces clean
data set 3 days after initial set
is posted.
Maintenance Jobs – Updating Fact Data (cont’d)
Fact
Source
System
Sun
Pre
Data
Sunday
Extract
Program
Compare
Program
Sun
Post
Data
Wed
Pre
Data
Fact
Table
Delta
File
DWH
RDBMS
29
Maintenance Jobs – Re-Organizing Data
Region
Relationship
between Region
and Store
changes
Store
Lookup Region
Lookup Store
Region_id
Region_desc
Store_id
Store_desc
Region_id
Must update
foreign key in
Store Lookup
Store_id
Store_desc
Region_id
13
21
24
27
35
57
San Fran
Boston
Dallas
Philly
DC
Las Vegas
2
1
2
1
1
2
Dallas is moved
to the East
Region
1
Maintenance Jobs – Re-Organizing Data (cont’d)
Lookup Region
Fact Sales
Region_id
Region_desc
Lookup Store
Region_id
Store_id
Store_desc
Region_id
Store_id
Item_id
Week_id
Sales_dollars
Sales_units
Must update key
for Store in all
tables
Best Case
Worst Case
Region_id
Store_id
Store_desc
Region_id
Store_id
Store_desc
1
1
1
2
2
2
04
07
11
03
08
11
Boston
Philly
DC
San Fran
Dallas
Las Vegas
1
1
1
2
2
2
04
07
11
03
07
11
Boston
Philly
DC
San Fran
Dallas
Las Vegas
1
1
X
X
30
Maintenance Jobs – Re-Organizing Data (cont’d)
Lookup Region
Region Sales
Region_id
Region_desc
Region_id
Item_id
Date
Sales_Dollars
Sales_Units
Lookup Store
Store_id
Store_desc
Region_id
It is necessary to
Re-aggregate
table values
Store Sales
Store_id
Store_desc
Region_id
13
21
24
27
35
57
San Fran
Boston
Dallas
Philly
DC
Las Vegas
2
1
2
1
1
2
1
Store_id
Item_id
Date
Sales_Dollars
Sales_Units
Batch Process – Frequency
9 Batch job frequencies differ with data sources and level of
detail
9 Typically there will be a set of batch routines dedicated to each
level of time detail (daily, weekly and monthly batch job)
9 Frequency is often but not necessarily tied to the level of time
detail included in the data files to be loaded during that batch
routine
31
Batch Process – Frequency (cont’d)
Frequency vs. Detail Chart
Frequency
Daily
Weekly
Daily
A
B
Weekly
C??
A
Level of Detail
Current week problem
Weekly Tables
Daily Tables
Current Week
References
Golfarelli, M., Rizzi, S., Data Warehouse: Teoria e pratica
della progettazione, McGraw-Hill, 2002.
32
Download