Slides

advertisement
Efficient Incremental
Maintenance of Data Cubes
2006. 9. 15.
Ki Yong Lee
Software Laboratories
Samsung Electronics Co., Ltd.
Myoung Ho Kim
Division of Computer Science
Korea Advanced Institute of Science and Technology
Outline


Introduction
Related work
–
–

Our approach
–
–
–


Incremental maintenance of aggregate views
Incremental maintenance of data cubes
Key idea
Problem formulation
Heuristic algorithm
Performance evaluation
Conclusion
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
2
Data Cube

A generalized group-by operator [GBP96]
–
Computes group-bys for all possible combinations of a given set of
attributes
SELECT a, b, SUM(m)
FROM F
GROUP BY a, b
SELECT a, b, SUM(m)
FROM F
CUBE BY a, b
Dimension attributes
K. Y. Lee and M. H. Kim
SELECT a, ‘*’, SUM(m)
FROM F
GROUP BY a
2n
SELECT ‘*’, b, SUM(m)
FROM F
GROUP BY b
SELECT ‘*’, ‘*’, SUM(m)
FROM F
(GROUP BY )
32th International Conference on Very Large Data Bases
3
Cube Lattice

We represent a data cube as a lattice diagram [HRU96]
–
–
Each node represents a group-by, which is called a cuboid
Each edge (qi, qj) represents that qj can be computed from qi

Cuboid (group-by)
ab
a
b
c
d
ac
ad
bc
bd
abc
abd
acd
bcd
cd
Aggregation
abcd
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
4
Maintenance of Data Cubes


A data cube is typically stored as a materialized view
How can we update a data cube efficiently when the
source relations change?

SELECT a, b, c, d, SUM(m)
FROM F
CUBE BY a, b, c, d
ab
a
b
c
d
ac
ad
bc
bd
abc
abd
acd
bcd
cd
?
SELECT a, b, c, d, SUM(m)
FROM F’
CUBE BY a, b, c, d
abcd
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
5
Related Work (1/2)
Incremental maintenance of an aggregate view

A:SELECT a, b, c, SUM(m)
FROM F
GROUP BY a, b, c
F
a
b
c
m
1
1
1
3
a
b
c
SUM(m)
1
2
2
3
1
1
1
3
2
1
3
5
1
2
2
3
2
1
3
4
2
1
3
9
ΔF
ΔA:SELECT a, b, c, SUM(m)
FROM ΔF
GROUP BY a, b, c
a
b
c
SUM(m)
SUM(m)
1
1
1
3
2
4
1
2
2
3+4
1
8
2
1
3
9
2
3
1
8
a
b
c
m
1
2
2
4
a
b
c
2
3
1
2
1
2
2
3
1
6
2
3
K. Y. Lee and M. H. Kim
A’
32th International Conference on Very Large Data Bases
6
Related Work (2/2)

Incremental maintenance of a data cube
–
–
Propagate stage: computes the delta cube
Refresh stage: refreshes the data cube by the delta cube

a
ab
b
ac
Δa
Δb
Δc
a’
b’
c’
+
bc
Δab
Δac
Δbc
ab’
ac’
bc’
c
abc
Δabc
F
ΔF
Original cube
Delta cube
K. Y. Lee and M. H. Kim
’
Δ
Delta cuboid
abc’
Updated cube
32th International Conference on Very Large Data Bases
7
Motivation

To incrementally maintain a data cube with 2n cuboids,
existing methods compute 2n delta cuboids
–
As n increases, the maintenance cost increases significantly
∆

ab
a
b
c
d
ac
ad
bc
bd
abc
abd
acd
bcd
cd
∆ab
∆a
∆b
∆c
∆d
∆ac
∆ad
∆bc
∆bd
∆abc
∆abd
∆acd
∆bcd
abcd
∆abcd
Original cube
Delta cube
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
∆cd
8
Motivation (cont’d)

Each cuboid is refreshed separately in existing methods
∆
∆a
∆b
∆c
∆d
∆ab
∆ac
∆ad
∆bc
∆bd
∆cd
∆abc
∆abd
∆acd
∆bcd
∆abcd
K. Y. Lee and M. H. Kim

a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
2n delta cuboids
are used
32th International Conference on Very Large Data Bases
9
Key Idea

Refresh more than one cuboid by a delta cuboid
∆
∆a
∆b
∆c
∆d
∆ab
∆ac
∆ad
∆bc
∆bd
∆cd
∆abc
∆abd
∆acd
∆bcd
∆abcd
K. Y. Lee and M. H. Kim

a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
∆bc
∆cd
∆bcd
∆acd
∆abd
∆abcd
bc
cd
b
bc
bcd
c
ac
acd
d
ad
abd

a
ab
abc
abcd
32th International Conference on Very Large Data Bases
10
Key Idea (cont’d)

Benefit
–
The number of delta cuboids that need to be computed is reduced
∆,
∆a,∆b,∆c,∆d,
∆bc,∆cd,
∆ab,∆ac,∆ad,∆bc,∆bd,∆cd,
∆abd,∆acd,∆bcd,
∆abc,∆abd,∆acd,∆bcd,
∆abcd
∆abcd
2n = 16 delta cuboids
Only 6 delta cuboids
need to be computed
need to be computed
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
11
Key Idea (cont’d)

Refreshing more than one cuboid by a delta cuboid
∆ab
ab
a
b
c
SUM(m)
a
b
c
SUM(m)
1
1
*
19
1
1
*
9
1
2
*
14
1
2
*
7
 13
+4
∆abc
abc
a
b
c
SUM(m)
a
b
c
SUM(m)
1
1
3
4
1
1
2
3
1
1
1
6
1
1
3
4
1
1
2
9
1
1
1
2
1
2
2
2
1
2
3
5
1
2
1
4
1
2
1
2
1
2
3
8
1
2
2
6
K. Y. Lee and M. H. Kim
+4
32th International Conference on Very Large Data Bases
8
12
Key Idea (cont’d)

However, this method requires more access to ab
∆ab
ab
a
b
c
SUM(m)
a
b
c
SUM(m)
1
1
*
19
1
1
*
9
1
2
*
14
1
2
*
7
|Δab|
|Δabc|
∆abc
abc
a
b
c
SUM(m)
a
b
c
SUM(m)
1
1
3
4
1
1
2
3
1
1
1
6
1
1
3
4
1
1
2
9
1
1
1
2
1
2
2
2
1
2
3
5
1
2
1
4
1
2
1
2
1
2
3
8
1
2
2
6
K. Y. Lee and M. H. Kim
|Δabc|
32th International Conference on Very Large Data Bases
13
Key Idea (cont’d)

But, if Δabc is sorted by a, b and c, ab can be refreshed
by Δabc without more access to ab
∆ab
ab
a
b
c
SUM(m)
a
b
c
SUM(m)
1
1
*
19
1
1
*
9
1
2
*
14
1
2
*
7
 28
+19
∆abc
abc
a
b
c
SUM(m)
mab
a
b
c
SUM(m)
1
1
1
6
6
1
1
2
12
1
1
2
9
15
1
1
3
4
1
1
3
4
19
1
1
1
2
1
2
1
4
1
2
3
5
1
2
2
2
1
2
1
2
1
2
3
8
1
2
2
6
K. Y. Lee and M. H. Kim
+4
32th International Conference on Very Large Data Bases
8
14
Key Idea (cont’d)

A delta cuboid can be easily sorted during its computation
with little or no additional cost
–
–

In most existing commercial relational database systems,
aggregation algorithms are based on sorting [G93]
If a group-by is computed by sorting algorithms, sorted results on
the grouping attributes can be easily obtained
We assume that a delta cuboid is computed by sorting
based algorithms
–
Thus, the above method can be applied to any delta cuboid with
no additional sorting cost
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
15
Generalization of the Idea

Δ[d1d2…dk]
–

The following set of delta cuboids can be refreshed by Δ[d1d2…dk]
–

{d1d2…dk, d1d2…dk-1, …, d1, }
Δ[d1d2…dk]  {q1, q2, …, qi}
–

A delta cuboid sorted in the order of attributes d1, d2, …, dk.
Cuboids q1, q2, …, qi are refreshed by Δ[d1d2…dk]
Example
–
Δ[abcd]  {abcd, abc, ab, a, }
• Cuboids abcd, abc, ab, a,  are refreshed by Δ[abcd]
–
Δ[acdb]  {acdb, acd, ac, a, }
• Cuboids acdb, acd, ac, a,  are refreshed by Δ[acdb]
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
16
Our Approach

We propose an incremental maintenance method that can
maintain a data cube with 2n cuboids using only a subset of
2n delta cuboids
∆[abc]  {abc}
∆[ab]  {ab}
∆[ca]  {ca}
∆[acb]  {acb, ac}
∆[bc]  {bc}
∆[ba]  {ba, b}
∆[a]  {a}
∆[cb]  {cb, c}
∆[b]  {b}
∆[a]  {a, }
∆[abc]  {abc, ab, a, }
∆[ca]  {ca, c}
∆[bc]  {bc, b}
∆[c]  {c}
∆  {}
23 = 8 delta cuboids
K. Y. Lee and M. H. Kim
4 delta cuboids
3 delta cuboids
32th International Conference on Very Large Data Bases
17
Cost of Computing Delta Cuboids


Different sets of delta cuboids incur different computation
cost
We represent the cost of computing delta cuboids by the
cost of a delta cuboid computation plan
∆[abc], ∆[ab], ∆[ca],
∆[bc], ∆[a], ∆[b], ∆[c],
∆
∆[acb], ∆[ba], ∆[cb],
∆[a]
∆[abc], ∆[ca], ∆[bc]
∆
∆
∆
3
∆a
7
∆b
∆c
6
8
∆ab
∆ca
16
15
∆bc
14
∆abc
K. Y. Lee and M. H. Kim
∆a
∆b
∆c
∆a
∆b
∆c
∆ca
∆bc
∆ab
∆ca
∆bc
8
∆ab
16
14
∆abc
15
14
∆abc
32th International Conference on Very Large Data Bases
18
Problem Formulation

Delta cube
–

ΔQ = {Δq1, Δq2, …, Δqm}, where Δqi is a delta cuboid
Refresh chain
–
A sequence of elements <Δq1, Δq2, …, Δqi> in ΔQ such that q1 
q2  …  qi
<Δq1, Δq2, …, Δqi> implies Δ[q1]  {q1, q2, …, qi}
–
Example
–
• <Δabc, Δab, Δa> implies Δ[abc]  {abc, ab, a}
• <Δcba, Δcb, Δc> implies Δ[cba]  {cba, cb, c}
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
19
Problem Formulation (cont’d)

Refresh partition
–
A partition of the elements of ΔQ into disjoint refresh chains
–
Example
Δ[acb]  {acb, ac}
{<Δacb, Δac>,
<Δab, Δb>,
implies
Δ[ba]  {ba, b}
<Δbc, Δc>,
Δ[cb]  {cb, c}
<Δa, Δ>}
Δ[a]  {a, }
{<Δabc, Δab, Δa, Δ>,
Δ[abc]  {abc, ab, a, }
<Δac, Δc>,
<Δbc, Δb>}
K. Y. Lee and M. H. Kim
implies
Δ[ca]  {ca, c}
Δ[bc]  {bc, b}
32th International Conference on Very Large Data Bases
20
Problem Formulation (cont’d)

Delta cuboid computation plan
–
A subtree of the delta lattice including at least all of the first
elements of refresh chains in a given refresh partition
–
Example
Δ
{<Δabc, Δab, Δa, Δ>,
Δa
Δb
Δc
Δab
Δca
Δbc
<Δca, Δc>,
<Δbc, Δb>}
15
14
Δabc
Refresh partition
K. Y. Lee and M. H. Kim
Delta cuboid computation plan
32th International Conference on Very Large Data Bases
21
Problem Statement

Given a delta cube and its delta lattice, find a refresh
partition that minimizes the cost of a delta cuboid
computation plan
Delta cube: {Δabc, Δab, Δac, Δbc, Δa, Δb, Δa, Δ}
find out
Refresh partition: {<Δabc, Δab, Δa, Δ>, <Δca, Δc>, <Δbc, Δb>}
∆
Delta cuboid computation plan:
∆a
∆b
∆c
∆ab
∆ca
∆bc
15
14
∆abc
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
22
NP-Hardness of the Problem

For a given refresh partition, finding the minimum cost
delta cuboid computation plan is NP-complete
–

Our problem is NP-hard
–
–

(proved in the paper)
Our problem is at least as hard as finding the minimum cost delta
cuboid computation plan
Moreover, there can be many refresh partitions for a given delta
cube
Hence, we resort to heuristic approaches
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
23
Idea behind Our Heuristic

As the number of delta cuboids to be computed increases,
the cost of a delta cuboid computation plan increases

Hence, we minimize the number of refresh chains in a
refresh partition as possible
{Δab}
{Δa}
{Δb}
{Δ}

{Δab, Δa}
{Δb}
{Δ}
{Δab, Δa}
{Δb, Δ}
The minimum number of refresh chains in a refresh partition
n
n
for a delta cube with 2 delta cuboids = n/2 (proved in the paper)
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
24
Heuristic Algorithm
①
Starts from the refresh partition with 2n refresh chains
–
②
Each refresh chain consists of only one delta cuboid
Repeatedly merge refresh chains until there are exactly
n
aaa
refresh chains in the refresh partition
n/2
–
Whenever refresh chains are merged, a new delta cuboid
computation plan with less cost is produced
{Δ}
{Δb, Δ}
{Δa}, {Δb}, {Δc}
{Δa, Δab}, {Δc},
{Δab}, {Δca}, {Δbc}
{Δca}, {Δbc}
{Δabc}
{Δabc}
2n
K. Y. Lee and M. H. Kim
{Δca, Δc},
{Δbc, Δb, Δ}
{Δabc, Δab, Δa}
n
n/2
32th International Conference on Very Large Data Bases
25
Example of the Algorithm
∆
Lv(0):
{∆}
3
Lv(1): ∆a
∆b
∆c
{∆a}
8
Lv(2): ∆ab
∆ca
∆bc
{∆b}
{∆c}
7
{∆ab}
6
{∆ca}
{∆bc}
{∆b,∆}
{∆a}
7
8
{∆ab}
6
{∆ca}
15
16
Lv(3):
{∆c}
{∆bc}
15
14
16
14
∆abc
{∆abc}
{∆abc}
(1) Input
(2) Step 1
(3) Step 2
{∆ab,∆a} {∆ca,∆c}{∆bc,∆b,∆}
15
16
{∆ca,∆c}{∆bc,∆b,∆}
15
14
14
{∆abc}
{∆abc,∆ab,∆a}
(4) Step 3
(5) Step 4
K. Y. Lee and M. H. Kim
{{∆ca, Δc},
{Δbc, Δb, Δ},
{Δabc, Δab, Δa}}
(6) Output
32th International Conference on Very Large Data Bases
26
Analysis of the Algorithm

Lemma 1: Given a delta cube with 2n delta cuboids, the
proposed heuristic algorithm produces a refresh partition
n
with exactly n/2 refresh chains

Thus, we need to compute only
a data cube with 2n cuboids
K. Y. Lee and M. H. Kim
n
n/2
delta cuboids to refresh
n
n
2n
2
4
2
3
8
3
…
…
…
8
256
70
9
512
126
10
1024
252
n/2
32th International Conference on Very Large Data Bases
27
Analysis of the Algorithm (cont’d)

Lemma 2: Let TC be a delta cuboid computation plan found
by the proposed heuristic. Then the following is true.
Cost(TC) < Cost(TG/2) < Cost(TG)
–
–
–
–

G: a delta lattice with 2n delta cuboids
G/2: a subgraph of G such that Level(0), Level(1), …, Level(n/2)
are removed from G
TG: the minimum spanning tree of G
TG/2: the minimum spanning tree of G/2
Thus, the cost of TC is bounded by the cost of TG/2
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
28
Performance Evaluation (1/3)

Data warehouse environment
–
–
–

Oracle9i database
Sun Blade 1000 with UltraSparc III CPU and 512MB RAM
TPC-H benchmark schema and data
Data cubes used in the experiments
–
Defined over lineitem table in the TPC-H schema
Data Cube
Q1
Q2
Q3
Dimension Attributes
l_orderkey, l_partkey, l_suppkey
l_orderkey, l_partkey, l_suppkey, l_shipdate
l_orderkey, l_partkey, l_suppkey, l_shipdate, l_receiptdate
K. Y. Lee and M. H. Kim
Measure
l_quantity
l_quantity
l_quantity
32th International Conference on Very Large Data Bases
29
Performance Evaluation (2/3)
70
Time (Sec.)
60
By varying the size of changes
Previous (Propagate)
Previous (Refresh)
250
Ours (Propagate)
Ours (Refresh)
Time (Sec.)

50
40
30
20
0
Time (Sec.)
150
100
0
2
600
Ours (Propagate)
Ours (Refresh)
50
10
700
200
Previous (Propagate)
Previous (Refresh)
4
6
8
Previous (Propagate)
Previous (Refresh)
10
12
14
16
18
20
2
4
6
8
10
12
Delta (%)
Delta (%)
(a) Q1
(b) Q2
14
16
18
20
Ours (Propagate)
Ours (Refresh)
500
400
300
200
100
0
2
4
6
8
10
12
14
16
18
20
Delta (%)
(c) Q3
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
30
Performance Evaluation (3/3)

The number of tuples generated in the experiment
Ours
Previous
400000
# of tuples
# of tuples
500000
300000
200000
100000
0
# of tuples
2
3500000
3000000
2500000
2000000
1500000
1000000
500000
0
4
6
8
10
12
14
16
18
20
1400000
1200000
1000000
800000
600000
400000
200000
0
Ours
Previous
2
4
6
8
10
12
Delta (%)
Delta (%)
(a) Q1
(b) Q2
14
16
18
20
Ours
Previous
2
4
6
8
10
12
14
16
18
20
Delta (%)
(c) Q3
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
31
Summary

We proposed an efficient incremental maintenance method
for data cubes

The proposed method can refresh a data cube with 2n
n
delta cuboids using only n/2 delta cuboids
–
The cost of computing delta cuboids can be substantially reduced

We formulated the problem and developed a heuristic
algorithm for this problem

We showed the efficiency of the proposed method through
performance evaluation
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
32
References




[GBP96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube:
A Relational Aggregation Operator Generalizing Group-By, Cross-Tab,
and Sub-Totals. In Proceedings of the ICDE Conference, p. 152-159,
1996
[HRU96] V. Harinarayan, A. Rajaraman, and J. D. Ullman.
Implementing Data Cubes Efficiently. In Proceedings of the ACM
SIGMOD Conference, p. 205-216, 1996.
[G93] Goetz Graefe, Query Evaluation Techniques for Large
Databases, ACM Computing Surveys, Vol. 25, Issue 2, p. 73-169, 1993.
[FG82] L. R. Foulds and R. L. Graham. The Steiner Problem in
Phylogeny is NP-Complete. Advances in Applied Mathematics, 3: 4349, 1982.
K. Y. Lee and M. H. Kim
32th International Conference on Very Large Data Bases
33
Download