Data Warehouse Architecture

advertisement
Data Warehouse
Architecture
Lecture #3
IMRAN KHAN
IBA
What & Why Architecture?
“An architecture is a set of rules to adhere to when building
something”
„ Because a data warehouse can become quite large and
complex, using an architecture is essential for success
„ Strict rules for how to architect a data warehouse do not
exist
„ over the last 15 years a few common architectures have
emerged
Imran Khan
IBA-FCS City Campus
What & Why Architecture?
According to research conducted in 2006 by The Data Warehousing Institute
(TDWI) , five possible ways to architect a data warehouse:
1.
2.
3.
4.
5.
Independent data marts—Each data mart is built and loaded individually;
there is no common or shared metadata. This is also called a stovepipe
solution.
Data mart bus—The Kimball solution with conformed dimensions.
Hub and spoke (corporate information factory)—The Inmon solution
with a centralized data warehouse and dependent data marts.
Centralized data warehouse—Similar to hub and spoke, but without the
spokes; i.e. all end user access is directly targeted at the data warehouse.
Federated—An architecture where multiple data marts or data
warehouses already exist and are integrated afterwards. A common
approach to this is to build a virtual data warehouse where all data still
resides in the original source systems and is logically integrated using
special software solutions.
Imran Khan
IBA-FCS City Campus
Imran Khan
IBA-FCS City Campus
Imran Khan
IBA-FCS City Campus
The Big Debate: Inmon Versus Kimball
„
In the beginning there were basically two
approaches to modeling the data warehouse.
„
Inmon popularized the term data warehouse
„
„
Strong proponent of a centralized and normalized
approach
Kimball took a different perspective with his Data
Marts and Conformed Dimensions.
Imran Khan
IBA-FCS City Campus
Differences between the Inmon and
Kimball approach
1.
2.
3.
Data warehouse versus data marts with conformed
dimensions
Centralized approach versus iterative/decentralized
approach
Normalized data model versus dimensional data model
Imran Khan
IBA-FCS City Campus
Conceptual DW Architectures
„ Direct
data mart
…Short
term, quick results
„ Architected,
enterprise data
warehouse
…Long
term foundation for future
development
Imran Khan
IBA-FCS City Campus
Data Mart
„
Data Mart
… A subset
of data (from the data warehouse)
designed to answer specific business
questions
… Also called:
Departmental Data Warehouse (Silverston &
Graziano)
„ Dimensional Data Warehouse (Kimball)
„
Imran Khan
IBA-FCS City Campus
Direct Data Mart
Transformation
Routines (ETL)
Source 1
Sales
Data Mart
Source 2
Financial
Data Mart
Source 3
Customer
Service
Data Mart
Imran Khan
IBA-FCS City Campus
Direct Data Marts
„ Pros:
…Build
individual data marts faster
„ Good
for prototyping
Imran Khan
IBA-FCS City Campus
Direct Data Marts
„
Cons:
… Requires
„
redundant coding
Must transform each source multiple times
…
…
…
Once for each data mart
New data marts require new transform for each source
New sources require multiple transformations
… If
business rules change, must change code in
multiple routines
… Increased number of routines may require more
processing power
… Multiple points of failure for ETL
„
Data marts can get out of sync
Imran Khan
IBA-FCS City Campus
Architected Data Warehouse
„ Core
enterprise data warehouse
design
…Based
on corporate (logical) data model
…May include an ODS
„ Specific
…Based
departmental data marts
on business needs
Imran Khan
IBA-FCS City Campus
Architected Data Warehouse
Sales
Data Mart
Source 1
Source 2
Enterprise
Data
Warehouse
Financial
Data Mart
Customer
Service
Data Mart
Source 3
Imran Khan
IBA-FCS City Campus
Architected Data Warehouse
„
Pros:
… Reduced
„
long term maintenance
Complex source transformations occur once
…
…
From source to staging area (or ODS)
If business rules change, code changes required in only one
place
… Reduces
points of failure
… Second set of ETL routines handle simple
aggregations and data segmentation
„
Easier to create new data marts
… Enterprise
DW becomes source of historical data
Imran Khan
IBA-FCS City Campus
Architected Data Warehouse
„ Cons:
…Requires
more disk space
…Requires 2 sets of ETL routines
Imran Khan
IBA-FCS City Campus
Corporate Information Factory
Information Workshop
Library & Toolbox
Workbench
Information Feedback
External
API
Data
Warehouse
ERP
Internet
API
API
Legacy
API
Other
Data
Acquisition
CIF Data
Management
Data
Delivery
Operational
Data Store
TrI
Operational
Systems
Systems
Management
Exploration
Warehouse
DSI
Data Mining
Warehouse
DSI
OLAP Data
Mart
DSI
Oper Mart
DSI
Meta Data Management
Data Acquisition
Management
Operation &
Administration
Service
Management
Imran Khan IBA-FCS City Campus
Change
Management
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
OLAP Server
Serve
Data
Warehouse
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Imran Khan
IBA-FCS City Campus
Source Data Component
„
Production Data
„
„
Internal Data
„
„
“private” spreadsheets, documents, customer profiles, and sometimes even
departmental databases.
Archived Data
„
„
data comes from the various operational systems of the enterprise e.g. financial
systems, manufacturing systems, systems along the supply chain, and customer
relationship management systems.
Some data is archived after a year. Sometimes data is left in the operational
system databases for as long as five years.
External Data
„
For example, the data warehouse of a car rental company contains data on the
current production schedules of the leading automobile manufacturers. This
external data in the data warehouse helps the car rental company plan for its fleet
management.
Imran Khan
IBA-FCS City Campus
Data Staging Component
„
„
„
„
Data Extraction
Data Transformation
Data Loading
Data staging provides a place and an area with a set of functions to
„ Clean
„ Change
„ Combine
„ Convert
„ Deduplicate
„ Prepare source data for storage and use in the data warehouse.
Imran Khan
IBA-FCS City Campus
Type of Meta Data
„
„
„
Operational metadata
Extraction & Transformation metadata
End-user metadata
Why is metadata especially important in a data warehouse?
„
„
„
First, it acts as the glue that connects all parts of the data warehouse.
Next, it provides information about the contents and structures to the
developers.
Finally, it opens the door to the end-users and makes the contents
recognizable in their own terms.
Imran Khan
IBA-FCS City Campus
OLAP Server Architectures
„
Relational OLAP (ROLAP)
…
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
… Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
… greater scalability
„
Multidimensional OLAP (MOLAP)
…
Array-based multidimensional storage engine (sparse matrix
techniques)
… fast indexing to pre-computed summarized data
„
Hybrid OLAP (HOLAP)
…
„
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
…
specialized support for SQL queries over star/snowflake schemas
Imran Khan
IBA-FCS City Campus
OLAP:On-Line Analytical Processing
„
„
an environment for the analysis of multi-dimensional data
… dice
… rotate
… drill-down
… rollup
OLAP provides advanced database support involving
attribute selection, attribute encoding, row sampling, data
cleansing and allows the use of multiple different search
engines
… easy to use user-interface
… open system architecture using local processing power
Imran Khan
IBA-FCS City Campus
Roll-up, Drill-down, Slicing, Dicing
Drill-Down
pop92
| state
|
| NOR_EAS NOR_CEN SOUTH WEST Total |
------------------------------------------------------------------------------------LAR_CITY |
3.62%
8.59%
15.68%
13.28% 41.17% |
MED_CITY |
3.35%
5.36%
5.18%
7.02%
20.91% |
SMA_CITY |
2.58%
5.66%
4.85%
5.16%
18.25% |
SUP_CITY |
8.30%
3.54%
2.54%
5.29%
19.67% |
------------------------------------------------------------------------------------Total
|
17.84%
23.15%
28.25% 30.75% 100.00% |
| state
|E_N_CEN E_SO_CE MID_ATL ...
--------------------------------------------------------LAR_C | 5.46%
2.76%
2.09% ...
MED_C | 3.84%
0.44%
1.38% ...
SM_C | 4.12%
0.92%
1.49%
...
SUP_C | 3.54%
0.00%
8.30%
...
--------------------------------------------------------Total
| 16.96%
4.12%
13.26%
...
Dicing
| state
|
| MID_ATL NEW_ENG NOR_EAS |
-----------------------------------------------------------------50000~60000 |
12.26%
13.69%
25.96% |
60000~70000 |
10.93%
7.13%
18.05% |
70000~80000 |
10.52%
14.83%
25.35% |
80000~90000 |
4.89%
9.56%
14.45% |
90000~99999 |
2.79%
13.40%
16.19% |
-----------------------------------------------------------------MED_CITY |
41.39%
58.61%
100.00% |
pop92
pop92
Imran Khan
IBA-FCS City Campus
pop92
Slicing
| state
|MID_ATL NEW_ENG NOR_EAS |
--------------------------------------------------------LAR_C | 11.72%
8.56%
20.28%
MED_C|
7.76% 10.99%
18.75%
SM_C |
8.34%
6.11%
14.45%
SUP_C | 46.52%
0.00%
46.52%
--------------------------------------------------------Total |
74.34%
25.66%
100.00%
|
|
|
|
|
|
Data Warehouse Design Process
„
„
„
Top-down, bottom-up approaches or a combination of both
… Top-down: Starts with overall design and planning (mature)
… Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
… Waterfall: structured and systematic analysis at each step before
proceeding to the next
… Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process
… Choose a business process to model, e.g., orders, invoices, etc.
… Choose the grain (atomic level of data) of the business process
… Choose the dimensions that will apply to each fact table record
… Choose the measure that will populate each fact table record
Imran Khan IBA-FCS City Campus
A practical approach
(blend of top down & bottom up)
The steps in this practical approach are as follows:
1. Plan and define requirements at the overall corporate
level
2. Create a surrounding architecture for a complete
warehouse
3. Conform and standardize the data content
4. Implement the data warehouse as a series of
supermarts, one at a time
Imran Khan
IBA-FCS City Campus
Imran Khan IBA-FCS City Campus
Download