Designing the data warehouse / data marts

advertisement
Designing the data warehouse
/ data marts
Methodologies and Techniques
Basic principles
Life cycle of the DW
First time load
Operational Databases
Warehouse Database
Refresh
Refresh
Purge or Archive
Refresh
Oracle Warehouse
Any Source Components
Any Data
Any Access
Operational
data
Relational /
Multidimensional
Relational
tools
Oracle Medi`
External
data
Text, image
Spatial
Web
Audio,
video
OLAP
tools
Applications/ Web
Oracle Intelligence Tools
IS develops
user’s Views
Current
Oracle Reports
Business users
Tactical
Oracle Discoverer
Analysts
Strategic
Oracle Express
Oracle Data Mart Suite
Data Modeling
Oracle Data Mart Designer
OLTP
Databases
OLTP
Engines
Warehousing
Engines
Data Mart
Database
Oracle8
SQL*PLUS
Data
Extraction
Data
Management
Data Access
& Analysis
Oracle Data Mart
Builder
Oracle Enterprise
Manager
Discoverer &
Oracle Reports
•
“Big Bang” Approach:
Advantages and
Disadvantages
Advantages:
– warehouse built as part of major project
(eg: BPR)
– Having a “big picture” of the data
warehouse before starting the data
warehousing project
• Disadvantages:
– Involves a high risk, takes a longer time
– Runs the risk of needing to change
requirements
Incremental Approach to
Warehouse Development
Strategy
Definition
Analysis
Design
Build
Production
• Multiple iterations
• Shorter
implementations
• Validation of each
phase
Benefits of an Incremental
Approach
• Delivers a strategic data warehouse
solution through incremental development
efforts
• Provides extensible, scalable architecture
• Quickly provides business benefits and
ensures a much earlier return of
investment
• Allows a data warehouse to be built based
on a subject or application area at a time
• Allows the construction of an integrated
data mart environment
Data Mart
• A subset of a data warehouse that
supports the requirements of a
particular department or business
function.
• Characteristics include:
– Do not normally contain detailed operational
data unlike data warehouses.
– May contain certain levels of aggregation
Dependent Data Mart
Flat Files
Operational
Systems
Marketing
Marketing
Sales
Finance
Human Resources
Data
Warehouse
Sales
Finance
Data Marts
External Data
Independent Data Mart
Operational
Systems
Flat Files
Sales or Marketing
External Data
Reasons for Creating a Data
Mart
• To give users more flexible access to
the data they need to analyse most
often.
• To provide data in a form that matches
the collective view of a group of users
• To improve end-user response time.
• Potential users of a data mart are
clearly defined and can be targeted for
support
Reasons for Creating a Data
Mart
• To provide appropriately structured data as
dictated by the requirements of the end-user
access tools.
• Building a data mart is simpler compared with
establishing a corporate data warehouse.
• The cost of implementing data marts is far
less than that required to establish a data
warehouse.
Data Marts Issues
•
•
•
•
Data mart functionality
Data mart size
Data mart load performance
Users access to data in multiple data
marts
• Data mart Internet / Intranet access
• Data mart administration
• Data mart installation
Example of DW tool OLAP
• Rotate and drill down to successive
levels of detail.
• Create and examine calculated data
interactively on large volumes of data.
• Determine comparative or relative
differences.
• Perform exception and trend analysis.
• Perform advanced analytical functions
for example forecasting, modeling, and
regression analysis
Original OLAP Rules
1. Multidimensional conceptual view
2. Transparency
3. Accessibility
4. Consistent reporting performance
5. Client-server architecture
Original OLAP Rules
6. Multiuser support
7. Unrestricted cross-dimensional
operations
8. Intuitive data manipulation
9. Flexible reporting
10. Unlimited dimensions and
aggregation levels
Relational Database Model
Attribute 1 Attribute 2 Attribute 3 Attribute 4
Name
Age
Gender
Emp No.
Row 1
Anderson
31
F
1001
Row 2
Green
42
M
1007
Row 3
Lee
22
M
1010
Row 4
Ramos
32
F
1020
The table above illustrates the employee relation.
Multidimensional Database
Model
Customer
Store
Store
Time
SALES
Product
Time
FINANCE
GL_Line
The data is found at the intersection of
dimensions.
Two dimensions
Three dimensions
Specialised Multidimensional tool
• Benefits:
– Quick access to very large volumes of data
– Extensive and comprehensive libraries of
complex functions
• analysis
• Strong modeling and forecasting capabilities
– Can access multidimensional and relational
database structures
– Caters for calculated fields
• Disadvantages:
– Difficulty of changing model
– Lack of support for very large volumes of data
– May require significant processing power
MOLAP Server
• The application layer
stores data in a
multidimensional structure
DSS client
• The presentation layer
provides the
MOLAP
multidimensional view
Engine
• Efficient storage and processing Application
layer
• Complexity hidden from the
user
• Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server
• The warehouse stores
DSS client
atomic data.
• The application layer
ROLAP
generates SQL for the
engine
three- dimensional view.
Application
• The presentation layer Multiple layer
SQL
provides the
multidimensional view.
Warehouse
server
MOLAP
MDDB
Query
Periodic
load
Warehouse
Data
Express
Server
Express
user
ROLAP
Cache
Live
fetch
Query
Data
cache
Warehouse
Data
Express
Server
Express
user
Also Hybrid (HOLAP)
Choosing a Reporting
Architecture
•
•
•
•
•
•
•
Business needs
Good
Potential for growth
Query
interface
Performance
enterprise architecture
OK
Network architecture
Speed of access
Openness
MOLAP
ROLAP
Simple
Complex
Analysis
Data Acquisition
• Identify, extract, transform, and transport
source data
• Consider internal and external data
• Perform gap analysis between source data
and target database objects
• Plan move of data between sources and target
• Define first-time load and refresh strategy
• Define tool requirements
• Build, test, and execute data acquisition
modules
Modeling
• Warehouses differ from operational
structures:
– Analytical requirements
– Subject orientation
• Data must map to subject oriented
information:
– Identify business subjects
– Define relationships between subjects
– Name the attributes of each subject
• Modeling is iterative
• Modeling tools are available
Modeling the Data Warehouse
1
1. Defining the business
model
2. Creating the dimensional
model
2, 3
3. Modeling summaries
4. Creating the physical model
4
Physical model
Select a
business
process
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
Creating the Dimensional Model
Identify fact tables
– Translate business measures into fact
tables
– Analyze source system information for
additional measures
– Identify base and derived measures
– Document additivity of measures
Identify dimension tables
Link fact tables to the dimension
tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
• Contain textual information that
represents the attributes of the business
• Contain relatively static data
• Are joined to a fact table through a
foreign key reference
Product
Channel
Facts
(units,
price)
Customer
Time
Fact Tables
Fact tables have the following characteristics:
• Contain numeric measures (metrics) of the
business
• May contain summarized (aggregated) data
• May contain date-stamped data
• Are typically additive
• Have key value that is typically a concatenated
key composed of the primary keys of the
dimensions
• Joined to dimension tables through foreign
keys that reference primary keys in the
dimension tables
Dimensional Model (Star
Schema)
Fact table
Product
Channel
Facts
(units,
price)
Customer
Time
Dimension tables
Star Schema Model
Product Table
Product_id
Product_desc
…
• Central fact table
• Radiating dimensions
• Denormalized model
Time Table
Day_id
Month_id
Period_id
Year_id
Store Table
Store_id
District_id
...
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_dollars
Sales_units
...
Item Table
Item_id
Item_desc
...
Star Schema Model
•
•
•
•
•
•
•
Easy for users to understand
Fast response to queries
Simple metadata
Supported by many front end tools
Less robust to change
Slower to build
Does not support history
Snowflake Schema Model
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Sales_dollars
Sales_units
Time Table
Week_id
Period_id
Year_id
Item Table
Item_id
Item_desc
Dept_id
Dept Table
Dept_id
Dept_desc
Mgr_id
Mgr Table
Dept_id
Mgr_id
Mgr_name
Snowflake Schema Model
•
•
•
•
Direct use by some tools
More flexible to change
Provides for speedier data loading
May become large and
unmanageable
• Degrades query performance
• More complex metadata
Using Summary Data
Phase 3: Modeling summaries
• Provides fast access to precomputed
data
• Reduces use of I/O, CPU, and memory
• Is distilled from source systems and
precalculated summaries
• Usually exists in summary fact tables
Designing Summary Tables
• Average
• Maximum
• Total
• Percentage
Units
Product A
Total
Product B
Total
Product C
Total
Sales(€)
Store
Summary Tables Example
SALES FACTS
Sales Region Month
10,000 North Jan 99
12,000 South Feb 99
11,000 North Jan 99
15,000 West Mar 99
18,000 South Feb 99
20,000 North Jan 99
10,000 East Jan 99
2,000 West Mar 99
SALES BY MONTH/REGION
Month Region Tot_Sales$
Jan 99 North 41,000
Jan 99 East 10,000
Feb 99 South 40,000
Mar 99 West 17,000
SALES BY MONTH
Month Tot_Sales
Jan 99 51,000
Feb 99 40,000
Mar 99 17,000
Summary Management
in Oracle8i
Sales
summary
Sales
Region
State
City
Product
Time
Summary advisor
Summary
usage
Summary
recommendations
Space
requirements
The Time Dimension
• Time is critical to the data warehouse.
• A consistent representation of time is
required for extensibility.
Sales fact
Time
dimension
How and where should it be stored?
Download