Notes 7

advertisement
Cube Computation and
Indexes for Data Warehouses
CPS 196.03
Notes 7
1
Processing
ROLAP servers vs. MOLAP servers
 Index Structures
 Cube computation
 What to Materialize?
 Algorithms

Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
2
ROLAP Server

Relational OLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
Special indices, tuning;
Schema is “denormalized”
relational
DBMS
3
MOLAP Server
Multi-Dimensional OLAP Server
Sales
B
A
M.D. tools
Product

milk
soda
eggs
soap
1
utilities
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
4
MOLAP
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
5
MOLAP
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
6
Challenges in MOLAP

Storing large arrays for efficient access
 Row-major,
column major
 Chunking
 Compressing
sparse arrays
Creating array data from data in tables
 Efficient techniques for Cube computation

Topics are discussed in the paper for reading
7
Index Structures

Traditional Access Methods
 B-trees,

hash tables, R-trees, grids, …
Popular in Warehouses
 inverted
lists
 bit map indexes
 join indexes
 text indexes
8
Inverted Lists
18
19
20
21
22
23
25
26
age
index
r5
r19
r37
r40
rId
r4
r18
r19
r34
r35
r36
r5
r41
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
20
23
r4
r18
r34
r35
inverted
lists
data
records
9
Using Inverted Lists

Query:
 Get
people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35
 List for name = “fred”: r18, r52
 Answer is intersection: r18

10
Bit Maps
20
23
20
21
22
1
1
0
1
1
0
0
0
0
23
25
26
age
index
bit
maps
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
18
19
data
records
11
Bitmap Index
Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the
value for the indexed column
 not suitable for high cardinality domains
Base table
Index on Region
Index on Type

Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Type RecIDAsia Europe America RecID Retail Dealer
Retail
1
1
0
1
1
0
0
Dealer 2
2
0
1
0
1
0
Dealer 3
1
0
0
3
0
1
4
0
0
1
4
1
0
Retail
0
1
0
5
0
1
Dealer 5
12
Using Bit Maps

Query:
 Get
people with age = 20 and name = “fred”
List for age = 20: 1101100000
 List for name = “fred”: 0100000001
 Answer is intersection: 010000000000

Good if domain cardinality small
 Bit vectors can be compressed

13
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
name price
bolt
10
nut
5
amt
12
11
50
8
44
4
14
Join Indexes
join index
product
sale
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
jIndex
r1,r3,r5,r6
r2,r4
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
15
Cube Computation for Data
Warehouses
16
Counting Exercise

How many cuboids are there in a cube?
 The
full or nothing case
 When dimension hierarchies are present

What is the size of each cuboid?
17
Lattice of Cuboids
129
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
product
city, date
date
product, date
c3
50
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
18
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
19
Dimension Hierarchies
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
state, product, date
not all arcs shown...
20
Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids
 The
bottom-most cuboid is the base cuboid
 The
top-most cuboid (apex) contains only one cell
 How
many cuboids in an n-dimensional cube with L
n
levels?
T   ( L 1)
i 1

i
Materialization of data cube
 Materialize
every (cuboid) (full materialization), none
(no materialization), or some (partial materialization)
 Selection

of which cuboids to materialize
Based on size, sharing, access frequency, etc.
21
Derived Data

Derived Warehouse Data
 indexes
 aggregates
 materialized
views (next slide)
When to update derived data?
 Incremental vs. refresh

22
Idea of Materialized Views

sale
Define new warehouse tables/arrays
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
does not exist
at any source
23
Efficient OLAP Processing

Determine which operations should be performed on available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection

Determine which materialized cuboid(s) should be selected for OLAP:

Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?

Explore indexing structures & compressed vs. dense arrays in MOLAP
24
What to Materialize?
Store in warehouse results useful for
common queries
 Example:
total sales

day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
materialize
c1
56
11
c2
4
8
c3
50
...
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
25
Materialization Factors
Type/frequency of queries
 Query response time
 Storage cost
 Update cost

Will study a concrete algorithm later
26
Iceberg Cube

Computing only the cuboid cells
whose count or other aggregates
satisfying the condition like
HAVING COUNT(*) >= minsup

Motivation
 Only
a small portion of cube cells may be “above the
water’’ in a sparse cube
 Only calculate “interesting” cells—data above certain
threshold
27
Challenges in MOLAP

Storing large arrays for efficient access
 Row-major,
column major
 Chunking
 Compressing
sparse arrays
Creating array data from data in tables
 Efficient techniques for Cube computation

Topics are discussed in the paper for reading
28
Download