Designing the data warehouse / data marts

advertisement
DESIGNING THE DATA WAREHOUSE
DATA MARTS
Basic principles
WAREHOUSE COMPONENTS
Any Source
Operational
data
Any Data
Relational /
Multidimensional
Any Access
Relational
tools
Oracle Medi`
External
data
Text, image
Spatial
Web
Audio,
video
OLAP
tools
Applications/ Web
INTELLIGENCE TOOLS
IS develops
user’s Views
Business users
Current
Tactical
Reports
Analysts
Strategic
Discoverer
DATA MART SUITE
Data Modeling
Data Mart Designer
OLTP
Databases
OLTP
Engines
Warehousing
Engines
Data Mart
Database
Data
Extraction
Data
Management
Data Access
& Analysis
Data Mart Builder
Enterprise
Manager
Discoverer &
Reports
DATA MART
A
collection of subject areas organized for
decision support based on the needs of a given
department
 Characteristics include:



A star-join structure that is optimal for the needs of
the users found in the department.
A dependent data mart is one whose source is a data
warehouse.
An independent data mart is one whose source is the
legacy applications environment
DATA WAREHOUSES VERSUS DATA MARTS
Data
Warehouse
Property
Scope
Subject
Data Source
Size(typical)
Implementation time
Data
Mart
Data Warehouse
Enterprise
Multiple
Many
100 GB to>1 TB
Months to years
Data Mart
Department
Single-subject
Few
<100 GB
Months
DEPENDENT DATA MART
Flat Files
Operational
Systems
Marketing
Marketing
Sales
Finance
Human Resources
Data
Warehouse
Sales
Finance
Data Marts
External Data
INDEPENDENT DATA MART
Operational
Systems
Flat Files
Sales or Marketing
External Data
REASONS FOR CREATING A DATA MART
 To
give users more flexible access to the
data they need to analyse most often.
 To
provide data in a form that matches the
collective view of a group of users
 To
improve end-user response time.
 Potential
users of a data mart are clearly
defined and can be targeted for support
REASONS FOR CREATING A DATA MART
 Building
a data mart is simpler compared
with establishing a corporate data
warehouse.
 The
cost of implementing data marts is far
less than that required to establish a data
warehouse.
EXAMPLE OF DW TOOL OLAP
 Rotate
and drill down to successive levels
of detail.
 Create and examine calculated data
interactively on large volumes of data.
 Determine comparative or relative
differences.
 Perform exception and trend analysis.
 Perform advanced analytical functions for
example forecasting, modeling, and
regression analysis
ORIGINAL OLAP RULES
1. Multidimensional conceptual view
2. Transparency
3. Accessibility
4. Consistent reporting performance
5. Client-server architecture
ORIGINAL OLAP RULES
6. Multi-user support
7. Unrestricted cross-dimensional
operations
8. Intuitive data manipulation
9. Flexible reporting
10. Unlimited dimensions and aggregation
levels
RELATIONAL DATABASE MODEL
Attribute 1 Attribute 2 Attribute 3 Attribute 4
Name
Age
Gender
Emp No.
Row 1
Anderson
31
F
1001
Row 2
Green
42
M
1007
Row 3
Lee
22
M
1010
Row 4
Ramos
32
F
1020
The table above illustrates the employee relation.
MULTIDIMENSIONAL DATABASE MODEL
Store
Time
SALES
Product
The data is found at the intersection of dimensions.
Two dimensions
Three dimensions
ROLAP SERVER
 The
warehouse stores
atomic data.
 The application layer
generates SQL for the
three- dimensional view.
 The presentation layer
provides the
multidimensional view.
DSS client
ROLAP
engine
Application
layer
Multiple
SQL
Warehouse
server
ROLAP
Cache
Live
fetch
Query
Data
cache
Warehouse
Data
Express
Server
Express
user
MOLAP SERVER
 The
application layer
stores data in a
multidimensional structure
 The presentation layer
provides the
multidimensional view
• Efficient storage and processing
• Complexity hidden from the user
• Analysis using preaggregated
summaries and precalculated
measures
DSS client
MOLAP
Engine
Application
layer
Warehouse
MOLAP
MDDB
Query
Periodic
load
Warehouse
Data
Express
Server
Express
user
CHOOSING A REPORTING ARCHITECTURE
Business needs
 Potential for growth
 Interface
 Enterprise architecture
 Network architecture
 Speed of access
 Openness

Good
MOLAP
Query
Performance
ROLAP
OK
Simple
Complex
Analysis
DATA ACQUISITION
 Identify,
extract, transform, and transport
source data
 Consider internal and external data
 Perform gap analysis between source data
and target database objects
 Plan move of data between sources and
target
 Define first-time load and refresh strategy
 Define tool requirements
 Build, test, and execute data acquisition
modules
MODELING
 Warehouses
differ from operational
structures:


Analytical requirements
Subject orientation
 Data
must map to subject oriented
information:



Identify business subjects
Define relationships between subjects
Name the attributes of each subject
 Modeling
is iterative
 Modeling tools are available
MODELING THE DATA WAREHOUSE




Defining the business model
Creating the dimensional model
Modeling summaries
Creating the physical model
1
Select a
business
process
2, 3
4
Physical model
FROM TABLES AND SPREADSHEETS TO
DATA CUBES



A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
27
CUBE: A LATTICE OF CUBOIDS
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
IDENTIFYING BUSINESS RULES
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
CREATING THE DIMENSIONAL MODEL

Identify fact tables




Translate business measures into fact tables
Analyze source system information for additional
measures
Identify base and derived measures
Document additivity of measures
Identify dimension tables
 Link fact tables to the dimension tables
 Create views for users

DIMENSION TABLES
Dimension tables have the following
characteristics:
 Contain textual information that
represents the attributes of the business
 Contain relatively static data
 Are joined to a fact table through a
foreign key reference
Product
Channel
Facts
(units,
price)
Customer
Time
A CONCEPT HIERARCHY: DIMENSION
(LOCATION)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
Data
Mini
ng:
Con
cept
s
and
Tech
niqu
FACT TABLES
Fact tables have the following characteristics:
 Contain
numeric measures (metrics) of the business
 May
contain summarized (aggregated) data
 May
contain date-stamped data
 Have
key value that is typically a concatenated key
composed of the primary keys of the dimensions
 Joined
to dimension tables through foreign keys
that reference primary keys in the dimension tables
Product
Channel
Facts
(units,
price)
Customer
Time
CONCEPTUAL MODELING OF DATA
WAREHOUSES

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to
a set of dimension tables

Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake

Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
DIMENSIONAL MODEL (STAR SCHEMA)
Fact table
Product
Channel
Facts
(units,
price)
Customer
Time
Dimension tables
STAR SCHEMA MODEL
Product Table
Product_id
Product_desc
…
Central fact table
 Radiating dimensions
 Denormalized model

Time Table
Day_id
Month_id
Period_id
Year_id
Store Table
Store_id
District_id
...
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_dollars
Sales_units
...
Item Table
Item_id
Item_desc
...
STAR SCHEMA MODEL
Easy for users to understand
 Fast response to queries
 Simple metadata
 Supported by many front end tools
 Less robust to change

SNOWFLAKE SCHEMA MODEL
Direct use by some tools
 More flexible to change
 Provides for speedier data loading
 May become large and unmanageable
 Degrades query performance
 More complex metadata

SNOWFLAKE SCHEMA MODEL
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Sales_dollars
Sales_units
Time Table
Week_id
Period_id
Year_id
Item Table
Item_id
Item_desc
Dept_id
Dept Table
Dept_id
Dept_desc
Mgr_id
Mgr Table
Dept_id
Mgr_id
Mgr_name
EXAMPLE OF FACT CONSTELLATION
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
USING SUMMARY DATA
 Provides
fast access to precomputed data
 Reduces use of I/O, CPU, and memory
 Usually exists in summary fact tables
Average?
MEASURES: THREE CATEGORIES

distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data without
partitioning.

algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each
of which is obtained by applying a distributive
aggregate function.


E.g., count(), sum(), min(), max().
42

E.g., avg(), min_N(), standard_deviation().
Data
holistic: if there is no constant bound on the storage size
Mini
ng:
needed to describe a subaggregate.
Con

E.g., median(), mode(), rank().
cept
s
and
Tech
niqu
DESIGNING SUMMARY TABLES
 Average
 Total
 Maximum
 Percentage
Units
Product A
Total
Product B
Total
Product C
Total
Sales(€)
Store
SUMMARY TABLES EXAMPLE
SALES FACTS
Sales Region Month
10,000 North Jan 99
12,000 South Feb 99
11,000 North Jan 99
15,000 West Mar 99
18,000 South Feb 99
20,000 North Jan 99
10,000 East Jan 99
2,000 West Mar 99
SALES BY MONTH/REGION
Month Region Tot_Sales$
Jan 99 North 41,000
Jan 99 East 10,000
Feb 99 South 40,000
Mar 99 West 17,000
SALES BY MONTH
Month Tot_Sales
Jan 99 51,000
Feb 99 40,000
Mar 99 17,000
SUMMARY MANAGEMENT
Sales
summary
Sales
Region
State
City
Product
Time
Summary advisor
Summary
usage
Summary
recommendations
Space
requirements
THE TIME DIMENSION
Time is critical to the data warehouse.
 A consistent representation of time is required
for extensibility.

Sales fact
Time
dimension
How and where should it be stored?
MULTIDIMENSIONAL DATA

Sales volume as a function of product, month, and
region
47
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
Data
Mini
ng:
Con
cept
s
and
Tech
niqu
A SAMPLE DATA CUBE
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
CUBOIDS CORRESPONDING TO THE CUBE
all
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
49
0-D(apex) cuboid
September 22, 2012
BROWSING A DATA CUBE
50
Data
Mini
ng:
Con
cept
s
and
Tech
niqu
Visualization
 OLAP capabilities
 Interactive manipulation


Roll up (drill-up): summarize data

Drill down (roll down): reverse of roll-up


project and select
Pivot (rotate):


from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:


by climbing up hierarchy or by dimension reduction
51

September 22, 2012
TYPICAL OLAP OPERATIONS
reorient the cube, visualization, 3D to series of 2D planes.
Other operations

drill across: involving (across) more than one fact table

drill through: through the bottom level of the cube to its
end relational tables (using SQL)
Data
Mini
ng:
Con
cept
backs
and
Tech
niqu
Slicing a data cube
52
Summary report
Example of drill-down
Drill-down with
color added
53
54
55
DATA WAREHOUSE BACK-END TOOLS AND
UTILITIES
Data extraction
 get data from multiple, heterogeneous, and external
sources
 Data cleaning
 detect errors in the data and rectify them when
possible
 Data transformation
 convert data from legacy or host format to warehouse
format
 Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse

56
Download