Presentation

advertisement
PrefixCube: Prefix-sharing
Condensed Data Cube
Jianlin Feng
Qiong Fang
Hulin Ding
Huazhong Univ. of Sci. & Tech.
fengjl@mail.hust.edu.cn
Nov 12, 2004
Outline
 Introduction
 Related
Work
 ODM: Ordered Datacube Model
 BST-Condensed Cube
 Prefix-sharing Condensed Cube
 Comparisons
 Conclusions
DOLAP 2004
2
Jianlin Feng
Introduction

Data Cube (ICDE’96)
– N-dimensional cube(A1, A2, …, AN)
– 2N cuboids, i.e. GROUP-BYs

The Huge Size Problem
– When R is sparse, the size of a cuboid is
possibly close to the size of R.
– The I/O cost even for storing the cube
result tuples becomes dominative.
DOLAP 2004
3
Jianlin Feng
Related Work
Condensed Cube (ICDE’02)
 Dwarf (SIGMOD’02)
 Quotient Cube (VLDB’02)
 QC-Tree (SIGMOD’03)
 Basic idea: remove redundancies
existing among cube tuples.

– prefix redundancy
– suffix redundancy
DOLAP 2004
4
Jianlin Feng
Prefix redundancy

Given an example cube(A, B, C)
– Each value of dimension A occurs in 4
cuboids: cuboid(A), (AB), (AC) and (ABC)
– Possibly many times in each cuboid
except cuboid(A)

Inter-cuboid and Intra-cuboid prefix
redundancy
DOLAP 2004
5
Jianlin Feng
Suffix Redundancy


Occurs when cube tuples belonging to
different cuboids are actually aggregated
from the same group of base relation tuples.
An extreme case
– Let the source relation R have only one single
tuple r(a1, a2, …, an, m);
– 2n cube tuples can be condensed into one physical
tuple: (a1, a2, …, an, V), where V = aggr(r);
– together with some information indicating that it
is a representative tuple.
DOLAP 2004
6
Jianlin Feng
Thinking…

Condensed cube
– It condenses those cube tuples, aggregated
from one single base tuple, into a physical
tuple in order to reduce cube’s size.

Dwarf
– Besides suffix coalescing, i.e. multi-basetuple condensing, it also realized full prefixsharing so as to achieve high cube size
reducing effectiveness.
DOLAP 2004
7
Jianlin Feng
Motivation


HOW to further reduce condensed cube’s
size while taking into account query
characteristics we intend to answer range query?
Augmenting BST-condensing with removing
of intra-cuboid prefix redundancy!
DOLAP 2004
8
Jianlin Feng
Ordered Datacube Model
Value ALL(or *) is encoded as 0.
 A dimension D and its cardinality C

– each dimension value is one-to-one mapped to
an integer value between 1 and C inclusively.
N dimensions form a N-dimensional space.
 The origin O(0, 0, …, 0) represents the
grand total.

DOLAP 2004
9
Jianlin Feng
Ordered Datacube Model

Under ODM, a range query against a
data cube can actually be reduced to
a sub-query against only one
particular cuboid in the cube or a
union of such sub-queries.
DOLAP 2004
10
Jianlin Feng
BST-Condensed Cube

Base Single Tuple (BST)
t1
t2
t3
A
8
1
1
B
1
8
2
C
1
1
3
M
100
50
60
– t1 is a BST on SD {A} and {B}
– t2 is a BST on SD {B}

A unique minimal BST-Condensed Cube
can be got when fully taking advantage of
each BST with all of its SDs - MinCube.
DOLAP 2004
11
Jianlin Feng
BU-BST Condensed Cube



BottomUpBST algorithms (ICDE’02)
Each BST corresponds to only one SD.
It’s easier to compute and to restore normal cube tuple
from condensed cube compared with MinCube.
Note: BST Condensing is a special kind of
Prefix-sharing !
A
B
C
M
8
8
8
8
*
1
*
1
*
*
1
1
10
10
10
10
A group of cube tuples
with sharing prefix are
represented by a BST!
ct7
DOLAP 2004
12
A
B
C
M
SD
8
1
1
10
{A}
Jianlin Feng
A BU-BST Condensed
Cube Example
Note:
Intra-cuboid prefix
redundancy: ct3 and ct4
Inter-cuboid prefix
redundancy: ct2, ct3 and ct5
t1
t2
t3
A
8
1
1
DOLAP 2004
B
1
8
2
C
1
1
3
M
100
50
60
ct1
ct2
ct3
ct4
ct5
ct6
ct7
ct8
ct9
ct10
ct11
ct12
13
A
*
1
1
1
1
1
8
*
*
*
*
*
B
*
*
2
8
*
*
1
1
2
8
*
*
C
*
*
3
1
1
3
1
1
3
1
1
3
M
210
110
60
50
50
60
100
100
60
50
150
60
SID
CID
ALL
A
AB
AB
AC
AC
A
B
B
B
C
C
Jianlin Feng
Prefix-sharing Condensed
Cube - PrefixCube
Prefix-sharing
BST Condensing
+
Intra-cuboid prefix-sharing
PrefixCube
DOLAP 2004
14
Jianlin Feng
A PrefixCube Example
N-Roots
V-Roots
CID = ALL
210
CID = A
CID = AC
CID = A
1 110
1
1 150
1 50
SID = A
SID = AB
3 60
SID = B
8
1
1
1
2
1 100
DOLAP 2004
3 60
15
8
2
8
1 50
1 50
3 60
3 60
1 100
Jianlin Feng
Corresponding Dwarf
8
1
1
(node2)
(node1)
A Dimension
8
2
1
1 50 3 60 110
8
2
(node3)
B Dimension
1 150
3 60 210
3 60 60
1 50 50
1 100 100
(node4)
C Dimension
DOLAP 2004
16
Jianlin Feng
PrefixCube vs. Dwarf
PrefixCube
Dwarf
Prefix-sharing
Intra-cuboid
Inter- and
Intra-cuboid
Suffix
Coalescing
BST
Condensing
Multi-tuple
Condensing
Compression
Ratio
Lower
Higher
Saving extra
value ALL?
No
Yes
Tuple clustered
by cuboid?
Yes
No
DOLAP 2004
17
PrefixCube does
not aim at blindly
achieving effective
compression ratio,
but it is intended
to make a good
compromise among
cube size reducing
ratio, restoring and
updating costs, and
query
characteristics!
Jianlin Feng
Effectiveness of Size Reduction

Datasets
100%
100%
80%
80%
Size Ratio
Size Ratio
– synthetic datasets with uniform distribution
– # of tuples: 1,000,000
60%
40%
BU-BST
PrefixCube
20%
60%
40%
BU-BST
PrefixCube
20%
0%
0%
2
3
4
5
6
7
8
9
2
Number of Dimensions
4
5
6
7
8
9
Number of Dimensions
(a) Cardinality = 100
DOLAP 2004
3
(b) Cardinality = 1000
18
Jianlin Feng
Effectiveness of Size Reduction

PrefixBUC
– Full Cube (computed by BUC)
– Prefix-sharing
100%
Size Ratio
80%
60%
40%
C=100
C=1000
20%
0%
2
3
4
5
6
7
8
9
Number of Dimensions
DOLAP 2004
19
Jianlin Feng
Impact of Data Density

Datasets
–
–
–
–
Uniform distribution
# of dimensions: 6
Cardinality of dimensions: 100
# of tuples: range from 1,000 to 1,000,000
100%
Size Ratio
80%
60%
40%
20%
BU-BST
P refixCube
P refixBUC
0%
1.E+03
1.E+04
1.E+05
1.E+06
Number of Tuples
DOLAP 2004
20
Jianlin Feng
Impact of Data Skewness

Datasets
– Zipf distribution
– # of tuples: 1,000,000
– Cardinality of dimensions: range from 1,000 to 500 with 100
interval
– Zipf factor: range from 0 to 0.8 with 0.2 interval
100%
Size Ratio
80%
60%
40%
BU-BST
P refixCube
P refixBUC
20%
0%
0
0.2
0.4
0.6
0.8
Zipf Factors
DOLAP 2004
21
Jianlin Feng
Real-world Dataset

Datasets
– Weather Datasets
– # of tuples: 1,015,367
700
100%
BUC
BU-BST
P refixCube
600
Time(sec.)
Size Ratio
80%
60%
40%
BU-BST
P refixCube
P refixBUC
20%
500
400
300
200
100
0
0%
2
3
4
5
6
7
8
2
9
4
5
6
7
8
9
Number of Dimensions
Number of Dimensions
DOLAP 2004
3
22
Jianlin Feng
Conclusion

A new cube structure PrefixCube was
proposed by augmenting BU-BST
condensing with intra-cuboid prefixsharing.
– It can greatly reduce data cube’s size
compared with BU-BST condensed cube.
– It can also reduce the impact of data skew
on BU-BST condensing.
– It can make a quite stable size reduction
on both dense and sparse datasets.
DOLAP 2004
23
Jianlin Feng
The End
Thank u!
Any question?
DOLAP 2004
24
Jianlin Feng
Download