Data Warehouse - Personal Web Pages

advertisement
Data Warehousing
Evolution of Database Technology
1960s:

Data collection, database creation, IMS and network DBMS
1970s:

Relational data model, relational DBMS implementation
1980s:

RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, etc.)
1990s—2000s:

Data mining and data warehousing, multimedia databases, and
Web databases
Can You Easily Answer
These Questions?
What is the correlation
between expenditures
and collection of
delinquent taxes?
What is the impact on
revenues and expenditures
of changing the operating
hours of the Dept. of Motor
Vehicles?
What are Personnel
Services costs across
all departments for
all funding sources?
What are the effects
of outsourcing
specific services?
What is the economic
impact of the small
business initiative in our
district?
Overview: Data Warehousing and
OLAP Technology for Data Mining
What is a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
What is a Warehouse?
Collection of diverse data







subject oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
What is a Warehouse?
Collection of tools





gathering data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse
Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)


Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Why Separate Data Warehouse?
High performance for both systems


DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:



missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse


Modify, summarize (store aggregates)
Add historical information
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Modeling OLTP Systems
Goal -- Update as many transactions as possible
in the shortest period of time
Approach
Model to 3rd Normal Form (3NF)
 Minimize redundancy to optimize update

Result
Create many (hundreds) of tables
 Difficult for business users to understand and use
 Retrieval requires many JOINs = lousy performance

Modeling the Data Warehouse
Tuning the relational model

Denormalize
– Reduces the number of tables
– Improves usability
– Improves performance

Add aggregate data (typically separate tables)
– Improves performance
– Degrades usability
From Tables and Spreadsheets
to Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions


Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions &
measures

Star schema: A fact table in the middle connected to a set of
dimension tables

Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Example of Snowflake
Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Typical OLAP Operations
Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:

project and select
Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes.
Other operations


drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its backend relational tables (using SQL)
Relational Operators
Select
Project
Join
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
OLAP Server Architectures
Relational OLAP (ROLAP)




ROLAP - provides a Multi-dimensional view of a relational DB (e.g.
MicroStrategy)
Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
greater scalability
Multidimensional OLAP (MOLAP)


Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)

User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers

specialized support for SQL queries over star/snowflake schemas
MOLAP Databases





Data is stored using a proprietary
format(MOLAP)
Accessible only through the DB vendor’s tools
Suitable only for summarized data
Data may be summarized in advance or real-time
Examples:



PowerPlay
Holos
Essbase
MOLAP versus ROLAP
MOLAP
ROLAP
Multidimensional OLAP
Relational OLAP
Data stored in multidimensional cube
Transformation required
Data retrieved directly
from cube for analysis
Faster analytical
processing
Cube size limitations
Data stored in relational
database as virtual cube
No transformation
needed
Data retrieved via SQL
from database for
analysis
Slower analytical
processing
No size limitations
Data Warehouse Back-End Tools and
Utilities
Data extraction:
 get data from multiple, heterogeneous, and external
sources
Data cleaning:
 detect errors in the data and rectify them when
possible
Data transformation:
 convert data from legacy or host format to warehouse
format
Load:
 sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
Refresh
 propagate the updates from the data sources to the
warehouse
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehouse to data mining
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Steps of a KDD Process
Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining

summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Summary
Data warehouse

A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making
process
A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations

A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and
pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
From OLAP to OLAM
Download