Data Warehouses for Decision Support What and Why of Data Warehousing

advertisement
Data Warehouses
for Decision Support
Dr. Denise Ecklund
16 October 2002
©2002 Denise Ecklund
DW-1
What and Why of Data Warehousing
Database
Database
Datastore
Database
System
Database
System
Data
System
Data
Extraction
and Load
Data
Warehouse
System
Queries
.
.
.
Datacubes
DSS app
workstations
• What: A very large database containing materialized views of
multiple, independent source databases.
The views generally contain aggregation data (aka datacubes).
• Why: The data warehouse (DW) supports read-only queries for
new applications, e.g., DSS, OLAP & data mining.
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-2
DW-1
Data Warehouse (DW) Life Cycle
• The Life Cycle:
Global Schema Definition
Data Extraction and Load
Query Processing
Data Update
• General Problems:
– Heavy user demand
– Problems with source data
• ownership, format, heterogeneity
– Underestimating complexity
& resources for all phases
• Boeing Computing Services – DW for DSS in airplane repair
•
•
•
•
•
DW size: 2-3 terabytes
Online query services: 24×7 service
Data life cycle: retain data for 70+ years (until the airplane is retired)
Data update: No “nighttime”; concurrent refresh is required
Access paths: Support new and old methods for 70+ years
©2002 Denise Ecklund
DW-3
Global Schema Design – Base Tables
• Fact Table
–
–
–
–
–
Stores basic facts from the source databases (often denormalized)
Data about past events (e.g., sales, deliverings, factory outputs, ...)
Have a time (or time period) associated with them
Data is very unlikely to change at a data source; no updates
Very large tables (up to 1 TB)
GSD
DE&L
QP
DU
• Dimension Table
– Attributes of one dimension of a fact table (typically denormalized)
– A chain of dimension tables to describe attributes on other
dimension tables, (normalized or denormalized)
– Data can change at a data source; updates executed occasionally
ProductID
SupplierID
PurchaseDate
DeliveryDate
CustYrs
Fact Table
©2002 Denise Ecklund
©2002 Denise Ecklund
ProductID
ProdName
ProdDesc
ProdStyle
ManufSite
SupplierID
SuppName
SuppAddr
SuppPhone
Date1stOrder
Dimension Tables
TimeID
Quarter
Quarter
Year
Year
AuditName
AuditID
AuditComp
Addr
AcctName
Phone
ContractYr
DW-4
DW-2
Schema Design Patterns
GSD
DE&L
QP
• Star Schema
D3
F
D1
• Snowflake Schema
D2
• Starflake Schema
D1
D3.1
F
D2.1
D1.2
D1, D2, D3 are denormalized
D3.1
F
D1.1
DU
D3.2
D2.2
D1, D2, D3 are normalized
• Constellation Schema
D3.2
F1
F2
D1
D2
D1 stores attributes about a
relationship between F1 and F2
D3 may be normalized or denormalized
©2002 Denise Ecklund
DW-5
Summary Tables
GSD
DE&L
QP
• aka datacubes or multidimensional tables
• Store precomputed query results for likely queries
DU
– Reduce on-the-fly join operations
– Reduce on-the-fly aggregation functions, e.g., sum, avg
• Stores denormalized data
• Aggregate data from one or more fact tables
and/or one or more dimension tables
• Discard and compute new summaries
as the set of likely queries changes
Summary
Table
Fact
Table
Dim
Table#2
©2002 Denise Ecklund
©2002 Denise Ecklund
Fact
Table
Dim
Table#1
DW-6
DW-3
Summary Tables = Datacubes
GSD
DE&L
Total Expenses for Parts by Product, Supplier and Quarter
Average Price paid to all suppliers of
parts for Product P11 in the 1st quarter
GROUP BY product, quarter
Supplier
All-S
S2
S1
Q1
0.2M £
Q2
0.4M £
Q3
0.6M £
Q4
1.0M £
All-Q
2.2M £
Fiscal
Quarter
P11
1.2M £
1.0M £
P14
P19
QP
Total Expenses paid to
all suppliers of parts for
Product P19 in the 1st quarter
GROUP BY product, quarter
DU
Total Expenses paid to
Supplier S1 for parts for
Product P33 in 2nd quarter
GROUP BY supplier, product, quarter
2.7M £
4.6M £
P27
P33
All-P
Product
Total Expenses paid to
Supplier S2 for parts for
all products in all quarters
GROUP BY supplier
• Typical, pre-computed Measures are:
– Sum, percentage, average, std deviation, count, min-value, max-value, percentile
©2002 Denise Ecklund
DW-7
Too Many Summary Tables
GSD
DE&L
• The Schema Design Problem:
QP
– Given a finite amount of disk storage, what views
(summaries) will you pre-compute in the data warehouse?
• Factors to be considered:
DU
A
– What queries must DW support?
– What source data are available?
– What is the time-space trade-off to store
versus re-compute joins and measures?
– Cost to acquire and update the data?
• An NP-complete optimization problem
B
D
C
F
E
G
H
ALL/None
Derivation Lattice
of materialized views
– Use heuristics and approximation algorithms
}
• Benefit Per Unit Space (BPUS)
• Pick By Size (PBS)
• Pick By Size–Use (PBS-U)
©2002 Denise Ecklund
©2002 Denise Ecklund
Use a derivation lattice to analyze
possible materialized views
DW-8
DW-4
A Lattice of Summary Tables
• Derivation Lattice
GSD
DE&L
Nodes: The set of attributes that would appear in
the ”group by” clause to construct this view
Edges: Connect view V2 to view V1 if V1
PC
can be used to answer queries over V2
P
MetaData: estimated # of records in each view
• Determine cost and benefit of each view
• Select a subset of the possible views
• Typical simplifying assumptions:
–
–
–
–
QP
6M
DU
PSC
SC
PS
S
C
ALL/None
Derivation Lattice
for parts, supplier, & customers
0.1M
Query cost ≈ # of records scanned to answer the query
I/O costs are much larger than CPU cost to compute measures
Ignore cost reductions due to using indexes to access records
All queries are equally likely to occur
©2002 Denise Ecklund
DW-9
Benefit Per Unit Space (BPUS)
View
A
B
C
D
A
B
C
D
F
E
G
H
ALL/None
Derivation Lattice
of materialized views
#MRecs View
100
E
50
F
75
G
20
H
GSD
DE&L
#MRecs
30
40
1
10
QP
DU
• S is the set of views we will materialize
• bf(u,v,S) = min(#w-#v | w∈S and w covers u)
• Benefit(v, S) = SUM (bf(u,v,S) | u=v or v covers u)
• Savings: read 420M records, not 800M
S={A}
S = S U {B}
S = S U {F}
S = S U {D}
©2002 Denise Ecklund
©2002 Denise Ecklund
Benefit- round#1
Benefit- round#2
B
C
D
E
F
G
H
B
C
D
E
F
G
H
50 * 5 = 250
25 * 5 = 125
80 * 2 = 160
70 * 3 = 210
60 * 2 = 120
99 * 1 = 99
90 * 1 = 90
25 * 2 = 50
30 * 2 = 60
20 * 3 = 60
60 + 10 = 70
49 * 1 = 49
40 * 1 = 40
Benefit- round#3
B
C
D
E
F
G
H
25 * 1 = 25
30 * 2 = 60
20 * 2 + 10= 50
49 * 1 = 49
30 * 1 = 30
DW-10
DW-5
Pick By Size (PBS)
Parts (0.2M)
Suppls (0.01M)
Custs (0.1M)
PC
PS
P
S
QP
SC
DU
C
ALL/None
Derivation Lattice
S = {PSC}
While (space > 0) Do
v = smallest views
If (space - |v| > 0) Then
space = space - |v|
S = S U {v}
views = views – {v}
Else space = 0
DE&L
PSC
Table sizes (in millions of records)
Parts+Suppls+Custs (6M)
Parts+Custs (6M)
Parts+Suppls (0.8M)
Suppls+Custs (6M)
GSD
• Storage Savings:
Reduced from 19.2M records
to 7.2M records
• Query Savings:
Read 19.11M records, not 42M
©2002 Denise Ecklund
DW-11
Pick By Size-Use (PBS-U)
GSD
DE&L
QP
• Extends the Pick By Size algorithm to consider
the frequency of queries on each possible view
DU
Table sizes (#Mrecs) & query frequency (probabilities)
Parts+Suppls+Custs (6M, 0.05) Parts (0.2M, 0.1)
Parts+Custs (6M, 0.3)
Suppls (0.01M, 0.1)
Parts+Suppls (0.8M, 0.3)
Custs (0.1M, 0.1)
Suppls+Custs (6M, 0.05)
While (space > 0) Do
v = smallest { |v| / prob(v), where v ∈ views }
If (space - |v| > 0) Then
space = space - |v|
S = S U {v}
views = views – {v}
Else space = 0
PSC
PC
PS
P
S
SC
C
ALL/None
Derivation Lattice
• This query frequency did not change the selected views → same savings
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-12
DW-6
Comparing Schema Design Algorithms
GSD
DE&L
• All Proposed Algorithms
QP
– Produce only a near optimal solution
DU
• Best known is within (0.63 – f ) of optimal,
where f is the fraction of space consumed by the largest table
– Make (unrealistic) assumptions
• e.g., ignore indexed data access
– Rely heavily on having good metadata
• e..g., table size and query frequency
• Algorithmic Performance
O(n3) runtime
– Benefit Per Unit Space (BPUS)
O(n log n) runtime
– Pick By Size (PBS)
• Limited applicability for PBS?
– Finds the near optimal solution only for SR-Hypercube lattices
A lattice forms an SR-Hypercube when for each v in the lattice, except v = DBT
|v| ≥ ((# of direct children of v) * (# of records in the child of v))
©2002 Denise Ecklund
DW-13
Data Extraction and Load
DE&L
Step1: Extract and clean data from all sources
QP
GSD
– Select source, remove data inconsistencies, add default values
Step2: Materialize the views and measures
DU
– Reformat data, recalculate data, merge data from multiple sources, add
time elements to the data, compute measures
Step3: Store data in the DW
– Create metadata and access path data, such as indexes
• Major Issue: Failure during extraction and load
• Approaches:
– UNDO/REDO logging
• Too expensive in time and space
– Incremental Checkpointing
• When to checkpoint? Modularize and divide the long-running tasks
• Must use UNDO/REDO logs also; Need high/performance logging
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-14
DW-7
Materializing Summary Tables
GSD
DE&L
QP
• Scenario: CompsiQ has factories in 7 cities. Each factory
manufactures several of CompsiQ’s 30 hardware products.
Each factory has 3 types of manufacturing lines:
robotic, hand-assembly, and mixed-line.
DU
• Schema for source data from Factory-A:
YieldInfo
ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year
ProductInfo
ProductCode ProductName ProductType FCS-Date EstProductLife
• Target summary query:
What is last year’s yield from Factory-A by product type?
©2002 Denise Ecklund
Materialization using SchemaSQL
What is last year’s yield from Factory-A by product type?
select p.ProductType, sum(y.lt)
At execution time,
from Factory-A::YieldInfo→ lt,
lt ranges over the
Factory-A::YieldInfo y,
attribute names in
Factory-A::ProductInfo p
relation YieldInfo
where lt < > ”ProductCode
and lt < > ”Week”
and lt < > ”Year”
and y.ProductCode = p.ProductCode
and y.Year = 01
group by p.ProductType
YieldInfo
DW-15
GSD
DE&L
QP
DU
ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year
ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-16
DW-8
Aggregation Over Irregular Blocks
YieldInfo
P11
P12
P13
P14
...
P11
P12
P13
P14
...
YieldInfo
ProductInfo
17 12 5 45 01
9 11 12 45 01
5 10
3 45 01
22 8
7 45 01
20 15 0 46 01
8 9 10 46 01
31 0
0 46 01
15 15 20 46 01
P11
P12
P13
P14
P15
ATMCard
SMILCard
ATMHub
MPEGCard
MP3
Net 3-8-99
Video 1-02-98
Net
1-11-99
Video 24-3-00
Audio 17-1-01
36
18
36
24
36
ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year
ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife
©2002 Denise Ecklund
DW-17
User Queries
GSD
DE&L
• Retrieve pre-computed data or formulate
new measures not materialized in the DW.
• New user operations on logical datacubes:
–
–
–
–
–
QP
DU
Roll-up, Drill-down, Pivot/Rotate
Slicing and Dicing with a “data blade”
Sorting
Supplier All-S
S2
S1
Selection
Q1
Derived Attributes Fiscal Q2
Quarter
Q3
Q4
All-Q
P11
P14
P19
P27
P33
Product
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-18
DW-9
Query Processing
•
•
•
•
Traditional query transformations
Index intersection and union
Advanced join algorithms
Piggy-backed scans
GSD
DE&L
QP
DU
– Multiple queries with different selection criteria
• SQL extensions => new operators
– Red Brick Systems has proposed 8 extensions, including:
•
•
•
•
•
MovingSum and MovingAvg
Rank … When
RatioToReport
Tertiles
Create Macro
©2002 Denise Ecklund
Data Update
• Data sources change over time
• Must “refresh” the DW
DW-19
GSD
DE&L
QP
DU
– Adds new historical data to the fact tables
– Updates descriptive attributes in the dimension tables
– Forces recalculation of measures in summary tables
• Issues:
1. Monitoring/tracking changes at the data sources
2. Recalculation of aggregated measures
3. Refresh typically forces a shutdown for DW query processing
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-20
DW-10
Monitoring Data Sources
Approaches:
1. Value-deltas - Capture before and after values of all
tuples changed by normal DB operations and store them
in differential relations.
GSD
DE&L
QP
DU
• Issues: must take the DW offline to install the modified values
2. Operation-deltas – Capture SQL updates from the
transaction log of each data source and build a new log
of all transactions that effect data in the DW.
• Advantages: DW can remain online for query processing while
executing data updates (using traditional concurrency control)
3. Hybrid – use value-deltas and operation-deltas for
different data sources or a subset of the relations from a
data source.
©2002 Denise Ecklund
Creating a Differential Relation
Approaches at the Data Source:
1.Execute the update query 3 times
DW-21
GSD
DE&L
QP
DU
• (1) Select and record the before values;
(2) Execute the update;
(3) Select and record the after values
• Issues: High cost in time & space;
reduces autonomy of the data sources
2. Define and insert DB triggers
• Triggers fire on “insert”, “delete”, and “update”
operations; Log the before and after values
• Issues: Not all data sources support triggers;
reduces autonomy of the data sources
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-22
DW-11
Creating Operation-Deltas
GSD
DE&L
QP
• The process:
DU
– Scan the transaction log at each data source
– Select pertinent transactions and delta-log them
• Advantage:
– Op-delta is much smaller than the value-delta
• Issues:
– Must transform the update operation on the data
source schema into an update operation on the DW
schema – not always possible.
Hence can not be used in all cases.
©2002 Denise Ecklund
DW-23
Recalculating Aggregated Measures
GSD
DE&L
QP
• Delta Tables
– Assume we have differential relations for the
base facts in the data sources (i.e., value deltas)
– Two processing phases (Propagation & Refresh):
DU
1) Propagation – pre-compute all new tuples and all
replacement tuples and store them in a delta table
Differential
Relations
Global
DW
Schema
©2002 Denise Ecklund
©2002 Denise Ecklund
Propagation
Process
Delta
Tables
DW-24
DW-12
Recalculating Aggregated Measures
GSD
DE&L
2) Refresh – Scan the DW tuples, replace existing
tuples with the pre-computed tuple values, insert
new tuples from the delta tables
QP
DU
Delta
Updated
Refresh
Process
Tables
DW
DW Tables
Tables
Issue:
Can not pre-compute Delta Table for non-commutative measures
Ex: average (without #records), percentiles
Must compute these during the refresh phase.
©2002 Denise Ecklund
DW-25
Data Marting
• What: Stores a second copy of a subset of a DW
Data
Warehouse
System
Data Mart System
Data
Extraction
and Load
Queries
datacubes
• Why build a data mart?
–
–
–
–
–
Data Mart System
Datacubes
Queries
DSS app
A user group with special needs (dept.)
workstations
Better performance accessing fewer records
To support a “different” user access tool
To enforce access control over different subsets
To segment data over different hardware platforms
©2002 Denise Ecklund
©2002 Denise Ecklund
.
.
.
Datacubes
DW-26
DW-13
Costs and Benefits of Data Marting
• System costs:
– More hardware (servers and networks)
– Define a subset of the global data model
– More software to:
• Extract data from the warehouse
• Load data into the mart
• Update the mart (after the warehouse is updated)
• User benefits:
– Define new measures not stored in the DW
– Better performance (mart users and DW users)
– Support a more appropriate user interface
• Ex: a browser with forms versus SQL queries
– Company achieves more reliable access control
©2002 Denise Ecklund
DW-27
Commercial DW Products
• Short list of companies with DW products:
–
–
–
–
Informix/Red Brick Systems
Oracle
Prism Solutions
Software AG
• Typical Products and Tools
– Specially tuned DB Server
– DW Developer Tools: data extraction, incremental update,
index builder
– User Tools: ad hoc query and spreadsheet tools for DSS
and post-processing (creating graphs, pie-charts, etc.)
– Application Developer Tools (toolkits for OLAP and DSS):
spreadsheet components, statistics packages, trend analysis
and forecasting components
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-28
DW-14
Ongoing Research Problems
• How to incorporate domain and business rules
into DW creation and maintenance
• Replacing manual tasks with intelligent agents
– Data acquisition, data cleaning, schema design,
DW access paths analysis and index construction
• Separate (but related) research areas:
– Tools for data mining and OLAP
– Providing active database services in the DW
©2002 Denise Ecklund
©2002 Denise Ecklund
DW-29
DW-15
Download