DW Notes

advertisement
Data Warehousing
CPS216
Notes 13
Shivnath Babu
Warehousing
Growing industry: $8 billion way back in
1998
 Range from desktop to huge:

 Walmart:
900-CPU, 2,700 disk, 23TB
Teradata system

Lots of buzzwords, hype
 slice
& dice, rollup, MOLAP, pivot, ...
2
Outline
What is a data warehouse?
 Why a warehouse?
 Models & operations
 Implementing a warehouse
 Future directions

3
What is a Warehouse?

Collection of diverse data
 subject
oriented
 aimed at executive, decision maker
 often a copy of operational data
 with value-added data (e.g., summaries, history)
 integrated
 time-varying
 non-volatile
more
4
What is a Warehouse?

Collection of tools
 gathering
data
 cleansing, integrating, ...
 querying, reporting, analysis
 data mining
 monitoring, administering warehouse
5
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
6
Motivating Examples
Forecasting
 Comparing performance of units
 Monitoring, detecting fraud
 Visualization

7
Why a Warehouse?

Two Approaches:
 Query-Driven
(Lazy)
 Warehouse (Eager)
?
Source
Source
8
Query-Driven Approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
9
Advantages of Warehousing
High query performance
 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Can query data not stored in a DBMS
 Extra information at warehouse

 Modify,
summarize (store aggregates)
 Add historical information
10
Advantages of Query-Driven

No need to copy data
 less
storage
 no need to purchase data
More up-to-date data
 Query needs can be unknown
 Only query interface needed at sources
 May be less draining on sources

11
OLTP vs. OLAP

OLTP: On Line Transaction Processing
 Describes

processing at operational sites
OLAP: On Line Analytical Processing
 Describes
processing at warehouse
12
OLTP vs. OLAP
OLTP







Mostly updates
Many small transactions
Mb-Gb of data
Raw data
Clerical users
Up-to-date data
Consistency,
recoverability critical
OLAP





Mostly reads
Queries long, complex
Gb-Tb of data
Summarized,
consolidated data
Decision-makers,
analysts as users
13
Data Marts
Smaller warehouses
 Spans part of organization

 e.g.,

marketing (customers, products, sales)
Do not require enterprise-wide consensus
 but
long term integration problems?
14
Warehouse Models & Operators

Data Models
 relations
 stars
& snowflakes
 cubes

Operators
 slice
& dice
 roll-up, drill down
 pivoting
 other
15
Star
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
customer
custId
53
81
111
store storeId
c1
c2
c3
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
city
nyc
sfo
la
amt
12
11
50
city
sfo
sfo
la
16
Star Schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
17
Terms
Fact table
 Dimension tables
 Measures

product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
18
Dimension Hierarchies
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
 snowflake schema
 constellations
region
sType tId
t1
t2
size
small
large
city cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
19
Cube
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
Multi-dimensional cube:
amt
12
11
50
8
p1
p2
c1
12
11
c2
c3
50
8
dimensions = 2
20
3-D Cube
Fact table view:
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
8
dimensions = 3
21
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical
Processing

22
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
81
23
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
ans
date
1
2
sum
81
48
24
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drill-down
25
Aggregates
Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy

 average
by region (within store)
 maximum by month (within date)
26
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums
...
c3
50
8
c2
4
8
rollup
drill-down
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
27
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
...
c3
50
sale(c1,*,*)
8
c2
4
8
c3
50
sale(c2,p2,*)
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
sale(*,*,*)
28
Extended Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
*
day 2
day 1
p1
p2
*
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
29
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
8
country
p1
p2
region A region B
56
54
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
30
Pivoting
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Pivot turns unique values from
one column into unique columns
in the output
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
8
c2
4
8
c3
50
31
Derived Data

Derived Warehouse Data
 indexes
 aggregates
 materialized
views (next slide)
When to update derived data?
 Incremental vs. refresh

32
Materialized Views

sale
Define new warehouse relations using
SQL expressions
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
does not exist
at any source
33
Processing
ROLAP servers vs. MOLAP servers
 Index Structures
 What to Materialize?
 Algorithms

Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
34
ROLAP Server

Relational OLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
Special indices, tuning;
Schema is “denormalized”
relational
DBMS
35
MOLAP Server
Multi-Dimensional OLAP Server
Sales
B
A
M.D. tools
Product

milk
soda
eggs
soap
1
utilities
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
36
Index Structures

Traditional Access Methods
 B-trees,

hash tables, R-trees, grids, …
Popular in Warehouses
 inverted
lists
 bit map indexes
 join indexes
 text indexes
37
Inverted Lists
18
19
20
21
22
23
25
26
age
index
r5
r19
r37
r40
rId
r4
r18
r19
r34
r35
r36
r5
r41
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
20
23
r4
r18
r34
r35
inverted
lists
data
records
38
Using Inverted Lists

Query:
 Get
people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35
 List for name = “fred”: r18, r52
 Answer is intersection: r18

39
Bit Maps
20
23
20
21
22
1
1
0
1
1
0
0
0
0
23
25
26
age
index
bit
maps
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
18
19
data
records
40
Using Bit Maps

Query:
 Get
people with age = 20 and name = “fred”
List for age = 20: 1101100000
 List for name = “fred”: 0100000001
 Answer is intersection: 010000000000

Good if domain cardinality small
 Bit vectors can be compressed

41
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
name price
bolt
10
nut
5
amt
12
11
50
8
44
4
42
Join Indexes
join index
product
sale
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
jIndex
r1,r3,r5,r6
r2,r4
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
43
What to Materialize?
Store in warehouse results useful for
common queries
 Example:
total sales

day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
materialize
c1
56
11
c2
4
8
c3
50
...
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
44
Materialization Factors
Type/frequency of queries
 Query response time
 Storage cost
 Update cost

45
Cube Aggregates Lattice
129
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
product
city, date
date
product, date
c3
50
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
use greedy
algorithm to
decide what
to materialize
46
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
47
Dimension Hierarchies
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
state, product, date
not all arcs shown...
48
Interesting Hierarchy
time
all
years
weeks
quarters
months
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
days
49
Download