Data Warehousing/Mining

advertisement
Data Warehousing/Mining
Comp 150
Data Warehousing Design
(not in book)
Instructor: Dan Hebert
Data Warehousing/Mining
1
Warehouse Design

What to materialize in the warehouse
– Which source data?
– Which summary tables?
– Which indices?
Influenced by both querying and
maintenance
 Trade storage space and update time for
query speed

Data Warehousing/Mining
2
Designing a Data Warehouse
Data models designed to support DW require
optimization strategies for DSS
 Design option

– Relational model in DW - ROLAP Servers for
analysis
– Special-purpose multi-dimensional data model in
DW (MDDB) - MOLAP Servers for analysis
Data Warehousing/Mining
3
Why is DW Design Different?


DSS: few transactions, each accessing a large
number of records
Typical ER designs tend to be complex and
difficult to navigate
Topic/Function
Data Content
Data Organization
Nature of Data
Data Structure, Format
Random Access Probability
Data Update
Usage
Response Time
Data Warehousing/Mining
Operational
Current values
Application by application
Dynamic
Complex: suitable for operational
computation
High
Updated on a field-by-field basis
Highly structured repeptive
processing
Sub-second to 2-3 seconds
Decision Support
Archival data
Subject areas across enterprises
Static until refreshed
Simple: suitable for business
analysis
Moderate to low
Accessed and read: no direct
update
Highly unstructured analytical
processing
Seconds to minutes, hours
4
Multi-Dimensional Data
Measures - numerical data being tracked
 Dimensions - business parameters that define
a transaction
 Example: Analyst may want to view sales
data (measure) by geography, by time, and by
product (dimensions)
 Dimensional modeling is a technique for
structuring data around the business concepts
 ER models describe “entities” and
“relationships”
 Dimensional models describe “measures”
and “dimensions”

Data Warehousing/Mining
5
Dimensional Modeling Using
Relational DBMS
Special schema design: star, snowflake
 Special indexes: bitmap, multi-table join
 Special tuning: maximize query throughput
 Proven technology (relational model, DBMS),
tend to outperform specialized MDDB
especially on large data sets
 Products

– IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
Data Warehousing/Mining
6
Dimensional Modeling Using
Special-Purpose Model (MDDB)
Facts stored in multi-dimensional arrays
 Dimensions used to index array
 Sometimes on top of relational DB
 Products

– Pilot, Arbor Essbase, Gentia
Data Warehousing/Mining
7
Example
“Sales by product line over the past six months”
“Sales by account between 1990 and 1995”
Account Info
Key columns joining fact table
Numerical Measures
to dimension tables
Prod Code Time Code Acct Code
Qty
Fact table for
measures
Product Info
Dimension tables
Sales
Time Info
...
Data Warehousing/Mining
8
Dimensional Modeling

Dimensions are organized into hierarchies
– E.g., Time dimension: days  weeks  quarters
– E.g., Product dimension: product  product line  brand


Dimensions have attributes
Physical architecture describe by Star Schema
Data Warehousing/Mining
9
Example Cont’d
Time
Time Code
Quarter Code
Quarter Name
Week Code
Day Code
Day name
Account
Account Code
Key Account Code
Account Name
Account Type
Account Market
Data Warehousing/Mining
Sales
Geography Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Geography
Geography Code
Region Code
Region Mgr
City Code
City Name
Product
Product Code
Product Name
Brand Mgr
Brand Code
Prod. Line Code
Prod. Line Name
Prod. Name
...
10
Dimensional Modeling Cont’d
Fact tables are fully normalized
 Dimension tables are denormalized

– Repetitively stored for sake of simplicity and
performance
Product_Code
Product_Name
Widget
Gadget
Snicket
Graplit
101
102
103
104
Product_Color
Blue
Blue
Orange
Orange
Brand_Code
XYZ
ABC
Product_Code
101
102
103
104
105
Product_Name
Widget
Gadget
Gadget
Snicket
Graplit
Data Warehousing/Mining
Product_Color
Blue
Blue
Green
Orange
Orange
Brand_Code
XYZ
XYZ
ABC
ABC
Brand_Mgr
J. Smith
T. Jones
Brand_Code
XYZ
XYZ
XYZ
ABC
ABC
Brand_Mgr
J. Smith
J. Smith
J. Smith
T. Jones
T. Jones
11
Extending Dimensional Modeling

Some instances when star schema is not ideal
– Denormalized schema may require too much
storage
– Very large dimension tables are affecting
performance negatively

“Snowflake schema”
– Normalized dimensions
Data Warehousing/Mining
12
Advantages of Dimensional Modeling
Define complex, multi-dimensional data with
simple model
 Reduces the number of phycial joins a query
has to process
 Allows the data warehouse to evolve with rel.
low maintenance
 HOWEVER! Star schema and rel. DBMS are
not the magic solution

– Query optimization is still problematic
Data Warehousing/Mining
13
Index Structures

Traditional access methods
– B-trees, hash tables, grid files, etc.

Popular in warehouses
– Inverted indexes (lists)
– Bit map indexes
– Join indexes
Data Warehousing/Mining
14
Inverted Index
Index for every keyword
 Query:

– “Get people with age =20 and name =‘Fred’”
(1) Use age index and retrieve ids: r4,r18,r34,r35
 (2) Use name index and retrieve ids: r18,r52
 (3) Answer is intersection: r18

Data Warehousing/Mining
15
Bit Map Index

Developed for Model 204 DBMS in 1987
18
19
Id
1
2
3
4
5
6
7
Name
Joe
Fred
Sally
Nancy
Tom
Pat
Dave
Age
20
20
21
20
20
25
21
18
20
23
20
21
22
23
25
26
Data Warehousing/Mining
1
1
0
1
1
0
0
0
0
1
0
0
0
0
16
Using Bit Maps

Query:
– “Get people with age=20 and name =‘Fred’”
(1) Bit map for age =20: 1101100
 (2) Bit map for name=‘Fred’: 0100000
 (3) Answer is intersection: 0100000
 Good if domain cardinality is small
 Bit vectors can be compressed

Data Warehousing/Mining
17
Join Index

Index on one table for a quantity that
involves a column value of a different table
Product
Sale
Id
P1
P2
rid
r1
r2
r3
r4
r5
r6
Data Warehousing/Mining
Name
Bolt
Nut
Prodid
P1
P2
P1
P2
P1
P1
Price
10
5
Storeid
C1
C1
C3
C2
C1
C2
jindex
r1,r3,r5,r6
r2,r4
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
18
Aggregation


Process by which low-level data is summarized in
advanced and placed into intermediate tables
Speeds up query processing, less ad-hoc
– “Show me total US sales for 1990”


How much to aggregate?
Data cube data model
– All possible aggregations along all dimensions
– Cells contain aggregated values
– How much of the cells in cube should be pre-computed?
Data Warehousing/Mining
19
Aggregation Cont’d

Special operators to navigate the hierarchies
–
–
–
–
–
Roll-up: remove a dimension element
e.g., Roll-up products to brands
Drill-down (opposite of roll-up),
Slice (defines a subcube)
Various visualization ops (e.g., pivot)
Data Warehousing/Mining
20
Example
roll-up to region
NY
SF
roll-up to brand
Product
LA
Juice
Milk
Coke
Cream
Soap
Bread
10
34
56
32
12
56
roll-up to week
M T W Th F S S
Dimensions:
Time, Product, Geography
Attributes:
Product (upc, price, …)
Geography …
…
Hierarchies:
Product  Brand  …
Day  Week  Quarter
City  Region  Country
Time
56 units of bread sold in LA on M
Data Warehousing/Mining
21
Warehouse DBMS—Buzzwords

Used primarily for decision support (DSS)
– A.K.A. On-Line Analytical Processing (OLAP)
– Complex queries, substantial aggregation
– TPC-D benchmark

Multidimensional data model
– Can be implemented either using rel. model or
proprietary data model
– Multi-dimensional database (MDDB)

Aggregation: Data Cube
– All possible groupings and aggregations
Data Warehousing/Mining
22
Warehouse DBMS — Buzzwords (2)
ROLAP vs. MOLAP
 Special purpose OLAP servers that directly
implement multidimensional data and
operations

– Roll-up = aggregate on some dimension
– Drill-down = deaggregate on some dimension
ROLAP: Oracle, Sybase IQ, RedBrick
 MOLAP: Pilot, Essbase, Gentia

Data Warehousing/Mining
23
Warehouse DBMS - Buzzwords (3)

Clients:
– Query and reporting tools
– Analysis tools
– Data mining: discovering patterns of various forms

Poses many new research issues in:
– Query processing and optimization
– Database design
– View management
Data Warehousing/Mining
24
Data Warehouse Physical Design
Data Warehousing/Mining
25
Common Design Activities – OLTP

Schema design (base tables)
– Normalization (3NF, BCNF, …)

Schema design (views)
– Mostly for convenience, security
– Usually NOT for performance
– Exception: View indexing [Roussopolous 1982]


Materialize pointers to tuples instead of tuples themselves
Index selection
– In practice, use rules of thumb
– Tool: DBDSGN [IBM Almaden], RDT for System R
Data Warehousing/Mining
26
Relational Views
Part of the ANSI/SPARC architecture
 Derived, virtual table
 View definition is an SQL query statement
 View update problem
 Good for logical data independence, security
 How to implement a view for querying

– Query modification: modify view query into a
query on the underlying base tables
– View materialization: physically implementing
view as table
Data Warehousing/Mining
27
View Indexing ...
In general, no need for materialized views in OLTP
systems
– Increase in performance through indexing
– Secondary storage space used to be expensive
 New idea (N. Roussopolous 1982) - view index
 Store index whose elements point to tuples which
comprise view
 View selection problem: Find a subset of views,
which, when indexed, minimizes the total cost of
answering all queries as well as cost of maintaining
the view structures

Data Warehousing/Mining
28
… View Indexing …
Assume N views to consider, 2N subsets
 Can’t do simple enumeration (cost to answer all
queries in a given subset)

– NP-complete problem

Solution uses search algorithm to approximate the
optimal view selection
– Potential exponential worst case
– Only subset of views needs to be considered

Cost function which computes for each state (set of
views + remaining storage)
(1) Cost to compute queries, maintenance of current index
set +
(2) Estimate of incremental cost that must be incurred in
extending view set (upper bound on actual cost)
Data Warehousing/Mining
29
… View Indexing
But ...
 Algorithm does not consider index selection
on views (view indexes)

– Indexes have impact on which view indexes to
choose

Very simple cost model (maintenance cost ~
size of view)
– Problem: Cost of maintaining view is a complex
query optimization problem
– Cannot be estimated without knowing which
subview indexes are chosen

Good first treatment of subject
Data Warehousing/Mining
30
Indexing ...

Which type of index structure, which attribute(s) to
index on
– “Access path selection” -> DBA
– Many choices, depend on many factors
– Space-time trade-off
Index selection problem: Which ordering rule for
stored records and which non-clustered indices
 Database practitioners use rules/guidelines (e.g.,
SYBASE manual)
 Design tools available

– Support dba during creation and maintenance of
database, i.e., solve the index-selection problem
Data Warehousing/Mining
31
Factors that Influence Index Selection
Maintenance
 Storage cost
 Global solution depends on index selection of
all tables combined

Data Warehousing/Mining
32
Example
ORDERS (OrderNo, SuppNo, PartNo, Date, Qty)
PARTS (PartNo, Descrip, SuppNo, QtyOnhand, Color, …)
Query:
SELECT
O.SuppNo
FROM
PARTS P, ORDERS O
WHERE
O.PartNo = P.PartNo AND O.SuppNo = 15
AND
P.QtyOnHand BETWEEN 100 AND
150


Situation 1: Assume PARTS clustered on Descrip and
non-clustered index on PartNo
Then: Best clustered index for ORDERS SuppNo
Situation 2: Assume PARTS clustered on PartNo
Then: Best clustered index for ORDERS PartNo
Data Warehousing/Mining
33
Data Warehouse Design

Schema design (base tables)
– Star schema (dimensions, measures)

Schema design (view/index selection)
– Mostly for performance enhancement

Physical warehouse design. Balance three costs:
(1) The cost of answering queries using warehouse
relations and additional structures
(2) Cost of maintaining additional structures
(3) Cost of secondary storage
Data Warehousing/Mining
34
WH Schema Design




Tables must map efficiently to the operational requests
OLTP: maximize concurrency, optimize
insert/update/delete performance
OLAP: Queries large, complex, ad-hoc, data-intensive, no
updates
Query centric view -> Star schema (facts, dimensions)
– Widely accepted, intuitive, easy to navigate (query formulation)

Problem: Poor performance on OLTP db engines
– Join processing (pair-wise join problem)
– Number of pair-wise joins for N tables = N!

e.g., 7 tables -> 5,040 combinations, 5 different join algorithms ->
25,200 combinations
Data Warehousing/Mining
35
Star Schema Join Problem
Heuristic: “pick directly related tables”
doesn’t work in star schema
 Options:

– Join unrelated tables (Cartesian product)
– Parallelism (speed-up, scale-up)
– New join techniques (e.g., bit vector star joins) in
combination with new indexing schemes (e.g., bit
maps, variant indexes)
Data Warehousing/Mining
36
Warehouse Access Path (Physical)
Design Problem
Materialize user queries as views (reduces
cost 1)
 How to reduce cost 2 and 3?
 “View Index Selection Problem VIS”

– Choose a set of supporting views and a set of
indexes to materialize such that the total
maintenance cost for the warehouse is minimized
(cost 2 & 3)
Data Warehousing/Mining
37
Solutions - Relational DB Design
Practices

Rel. DB design algorithms must be adapted
– View index approach has no index selection, simple
cost model (cannot achieve global solution by
locally optimizing each materialized subview)
– Index selection approach can be extended - but
trouble ahead

Algorithms require queries and frequencies as input
Data Warehousing/Mining
38
Solutions - Rule Condition Maintenance

Work on rule condition evaluation
– How to evaluate trigger conditions for rules
efficiently ( ~ view maintenance problem: rule is
triggered whenever view that satisfies its condition
becomes non-empty)
– Discrimination networks for each rule (view)
RETE model materializes selection and join nodes
 TREAT materializes only selection nodes

– Incremental evaluation techniques

Recommendations not generally applicable
Data Warehousing/Mining
39
Download