Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of Computer Science Korea Advanced Institute of Science and Technology Outline Introduction Related work – – Our approach – – – Incremental maintenance of aggregate views Incremental maintenance of data cubes Key idea Problem formulation Heuristic algorithm Performance evaluation Conclusion K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 2 Data Cube A generalized group-by operator [GBP96] – Computes group-bys for all possible combinations of a given set of attributes SELECT a, b, SUM(m) FROM F GROUP BY a, b SELECT a, b, SUM(m) FROM F CUBE BY a, b Dimension attributes K. Y. Lee and M. H. Kim SELECT a, ‘*’, SUM(m) FROM F GROUP BY a 2n SELECT ‘*’, b, SUM(m) FROM F GROUP BY b SELECT ‘*’, ‘*’, SUM(m) FROM F (GROUP BY ) 32th International Conference on Very Large Data Bases 3 Cube Lattice We represent a data cube as a lattice diagram [HRU96] – – Each node represents a group-by, which is called a cuboid Each edge (qi, qj) represents that qj can be computed from qi Cuboid (group-by) ab a b c d ac ad bc bd abc abd acd bcd cd Aggregation abcd K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 4 Maintenance of Data Cubes A data cube is typically stored as a materialized view How can we update a data cube efficiently when the source relations change? SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b, c, d ab a b c d ac ad bc bd abc abd acd bcd cd ? SELECT a, b, c, d, SUM(m) FROM F’ CUBE BY a, b, c, d abcd K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 5 Related Work (1/2) Incremental maintenance of an aggregate view A:SELECT a, b, c, SUM(m) FROM F GROUP BY a, b, c F a b c m 1 1 1 3 a b c SUM(m) 1 2 2 3 1 1 1 3 2 1 3 5 1 2 2 3 2 1 3 4 2 1 3 9 ΔF ΔA:SELECT a, b, c, SUM(m) FROM ΔF GROUP BY a, b, c a b c SUM(m) SUM(m) 1 1 1 3 2 4 1 2 2 3+4 1 8 2 1 3 9 2 3 1 8 a b c m 1 2 2 4 a b c 2 3 1 2 1 2 2 3 1 6 2 3 K. Y. Lee and M. H. Kim A’ 32th International Conference on Very Large Data Bases 6 Related Work (2/2) Incremental maintenance of a data cube – – Propagate stage: computes the delta cube Refresh stage: refreshes the data cube by the delta cube a ab b ac Δa Δb Δc a’ b’ c’ + bc Δab Δac Δbc ab’ ac’ bc’ c abc Δabc F ΔF Original cube Delta cube K. Y. Lee and M. H. Kim ’ Δ Delta cuboid abc’ Updated cube 32th International Conference on Very Large Data Bases 7 Motivation To incrementally maintain a data cube with 2n cuboids, existing methods compute 2n delta cuboids – As n increases, the maintenance cost increases significantly ∆ ab a b c d ac ad bc bd abc abd acd bcd cd ∆ab ∆a ∆b ∆c ∆d ∆ac ∆ad ∆bc ∆bd ∆abc ∆abd ∆acd ∆bcd abcd ∆abcd Original cube Delta cube K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases ∆cd 8 Motivation (cont’d) Each cuboid is refreshed separately in existing methods ∆ ∆a ∆b ∆c ∆d ∆ab ∆ac ∆ad ∆bc ∆bd ∆cd ∆abc ∆abd ∆acd ∆bcd ∆abcd K. Y. Lee and M. H. Kim a b c d ab ac ad bc bd cd abc abd acd bcd abcd 2n delta cuboids are used 32th International Conference on Very Large Data Bases 9 Key Idea Refresh more than one cuboid by a delta cuboid ∆ ∆a ∆b ∆c ∆d ∆ab ∆ac ∆ad ∆bc ∆bd ∆cd ∆abc ∆abd ∆acd ∆bcd ∆abcd K. Y. Lee and M. H. Kim a b c d ab ac ad bc bd cd abc abd acd bcd abcd ∆bc ∆cd ∆bcd ∆acd ∆abd ∆abcd bc cd b bc bcd c ac acd d ad abd a ab abc abcd 32th International Conference on Very Large Data Bases 10 Key Idea (cont’d) Benefit – The number of delta cuboids that need to be computed is reduced ∆, ∆a,∆b,∆c,∆d, ∆bc,∆cd, ∆ab,∆ac,∆ad,∆bc,∆bd,∆cd, ∆abd,∆acd,∆bcd, ∆abc,∆abd,∆acd,∆bcd, ∆abcd ∆abcd 2n = 16 delta cuboids Only 6 delta cuboids need to be computed need to be computed K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 11 Key Idea (cont’d) Refreshing more than one cuboid by a delta cuboid ∆ab ab a b c SUM(m) a b c SUM(m) 1 1 * 19 1 1 * 9 1 2 * 14 1 2 * 7 13 +4 ∆abc abc a b c SUM(m) a b c SUM(m) 1 1 3 4 1 1 2 3 1 1 1 6 1 1 3 4 1 1 2 9 1 1 1 2 1 2 2 2 1 2 3 5 1 2 1 4 1 2 1 2 1 2 3 8 1 2 2 6 K. Y. Lee and M. H. Kim +4 32th International Conference on Very Large Data Bases 8 12 Key Idea (cont’d) However, this method requires more access to ab ∆ab ab a b c SUM(m) a b c SUM(m) 1 1 * 19 1 1 * 9 1 2 * 14 1 2 * 7 |Δab| |Δabc| ∆abc abc a b c SUM(m) a b c SUM(m) 1 1 3 4 1 1 2 3 1 1 1 6 1 1 3 4 1 1 2 9 1 1 1 2 1 2 2 2 1 2 3 5 1 2 1 4 1 2 1 2 1 2 3 8 1 2 2 6 K. Y. Lee and M. H. Kim |Δabc| 32th International Conference on Very Large Data Bases 13 Key Idea (cont’d) But, if Δabc is sorted by a, b and c, ab can be refreshed by Δabc without more access to ab ∆ab ab a b c SUM(m) a b c SUM(m) 1 1 * 19 1 1 * 9 1 2 * 14 1 2 * 7 28 +19 ∆abc abc a b c SUM(m) mab a b c SUM(m) 1 1 1 6 6 1 1 2 12 1 1 2 9 15 1 1 3 4 1 1 3 4 19 1 1 1 2 1 2 1 4 1 2 3 5 1 2 2 2 1 2 1 2 1 2 3 8 1 2 2 6 K. Y. Lee and M. H. Kim +4 32th International Conference on Very Large Data Bases 8 14 Key Idea (cont’d) A delta cuboid can be easily sorted during its computation with little or no additional cost – – In most existing commercial relational database systems, aggregation algorithms are based on sorting [G93] If a group-by is computed by sorting algorithms, sorted results on the grouping attributes can be easily obtained We assume that a delta cuboid is computed by sorting based algorithms – Thus, the above method can be applied to any delta cuboid with no additional sorting cost K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 15 Generalization of the Idea Δ[d1d2…dk] – The following set of delta cuboids can be refreshed by Δ[d1d2…dk] – {d1d2…dk, d1d2…dk-1, …, d1, } Δ[d1d2…dk] {q1, q2, …, qi} – A delta cuboid sorted in the order of attributes d1, d2, …, dk. Cuboids q1, q2, …, qi are refreshed by Δ[d1d2…dk] Example – Δ[abcd] {abcd, abc, ab, a, } • Cuboids abcd, abc, ab, a, are refreshed by Δ[abcd] – Δ[acdb] {acdb, acd, ac, a, } • Cuboids acdb, acd, ac, a, are refreshed by Δ[acdb] K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 16 Our Approach We propose an incremental maintenance method that can maintain a data cube with 2n cuboids using only a subset of 2n delta cuboids ∆[abc] {abc} ∆[ab] {ab} ∆[ca] {ca} ∆[acb] {acb, ac} ∆[bc] {bc} ∆[ba] {ba, b} ∆[a] {a} ∆[cb] {cb, c} ∆[b] {b} ∆[a] {a, } ∆[abc] {abc, ab, a, } ∆[ca] {ca, c} ∆[bc] {bc, b} ∆[c] {c} ∆ {} 23 = 8 delta cuboids K. Y. Lee and M. H. Kim 4 delta cuboids 3 delta cuboids 32th International Conference on Very Large Data Bases 17 Cost of Computing Delta Cuboids Different sets of delta cuboids incur different computation cost We represent the cost of computing delta cuboids by the cost of a delta cuboid computation plan ∆[abc], ∆[ab], ∆[ca], ∆[bc], ∆[a], ∆[b], ∆[c], ∆ ∆[acb], ∆[ba], ∆[cb], ∆[a] ∆[abc], ∆[ca], ∆[bc] ∆ ∆ ∆ 3 ∆a 7 ∆b ∆c 6 8 ∆ab ∆ca 16 15 ∆bc 14 ∆abc K. Y. Lee and M. H. Kim ∆a ∆b ∆c ∆a ∆b ∆c ∆ca ∆bc ∆ab ∆ca ∆bc 8 ∆ab 16 14 ∆abc 15 14 ∆abc 32th International Conference on Very Large Data Bases 18 Problem Formulation Delta cube – ΔQ = {Δq1, Δq2, …, Δqm}, where Δqi is a delta cuboid Refresh chain – A sequence of elements <Δq1, Δq2, …, Δqi> in ΔQ such that q1 q2 … qi <Δq1, Δq2, …, Δqi> implies Δ[q1] {q1, q2, …, qi} – Example – • <Δabc, Δab, Δa> implies Δ[abc] {abc, ab, a} • <Δcba, Δcb, Δc> implies Δ[cba] {cba, cb, c} K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 19 Problem Formulation (cont’d) Refresh partition – A partition of the elements of ΔQ into disjoint refresh chains – Example Δ[acb] {acb, ac} {<Δacb, Δac>, <Δab, Δb>, implies Δ[ba] {ba, b} <Δbc, Δc>, Δ[cb] {cb, c} <Δa, Δ>} Δ[a] {a, } {<Δabc, Δab, Δa, Δ>, Δ[abc] {abc, ab, a, } <Δac, Δc>, <Δbc, Δb>} K. Y. Lee and M. H. Kim implies Δ[ca] {ca, c} Δ[bc] {bc, b} 32th International Conference on Very Large Data Bases 20 Problem Formulation (cont’d) Delta cuboid computation plan – A subtree of the delta lattice including at least all of the first elements of refresh chains in a given refresh partition – Example Δ {<Δabc, Δab, Δa, Δ>, Δa Δb Δc Δab Δca Δbc <Δca, Δc>, <Δbc, Δb>} 15 14 Δabc Refresh partition K. Y. Lee and M. H. Kim Delta cuboid computation plan 32th International Conference on Very Large Data Bases 21 Problem Statement Given a delta cube and its delta lattice, find a refresh partition that minimizes the cost of a delta cuboid computation plan Delta cube: {Δabc, Δab, Δac, Δbc, Δa, Δb, Δa, Δ} find out Refresh partition: {<Δabc, Δab, Δa, Δ>, <Δca, Δc>, <Δbc, Δb>} ∆ Delta cuboid computation plan: ∆a ∆b ∆c ∆ab ∆ca ∆bc 15 14 ∆abc K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 22 NP-Hardness of the Problem For a given refresh partition, finding the minimum cost delta cuboid computation plan is NP-complete – Our problem is NP-hard – – (proved in the paper) Our problem is at least as hard as finding the minimum cost delta cuboid computation plan Moreover, there can be many refresh partitions for a given delta cube Hence, we resort to heuristic approaches K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 23 Idea behind Our Heuristic As the number of delta cuboids to be computed increases, the cost of a delta cuboid computation plan increases Hence, we minimize the number of refresh chains in a refresh partition as possible {Δab} {Δa} {Δb} {Δ} {Δab, Δa} {Δb} {Δ} {Δab, Δa} {Δb, Δ} The minimum number of refresh chains in a refresh partition n n for a delta cube with 2 delta cuboids = n/2 (proved in the paper) K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 24 Heuristic Algorithm ① Starts from the refresh partition with 2n refresh chains – ② Each refresh chain consists of only one delta cuboid Repeatedly merge refresh chains until there are exactly n aaa refresh chains in the refresh partition n/2 – Whenever refresh chains are merged, a new delta cuboid computation plan with less cost is produced {Δ} {Δb, Δ} {Δa}, {Δb}, {Δc} {Δa, Δab}, {Δc}, {Δab}, {Δca}, {Δbc} {Δca}, {Δbc} {Δabc} {Δabc} 2n K. Y. Lee and M. H. Kim {Δca, Δc}, {Δbc, Δb, Δ} {Δabc, Δab, Δa} n n/2 32th International Conference on Very Large Data Bases 25 Example of the Algorithm ∆ Lv(0): {∆} 3 Lv(1): ∆a ∆b ∆c {∆a} 8 Lv(2): ∆ab ∆ca ∆bc {∆b} {∆c} 7 {∆ab} 6 {∆ca} {∆bc} {∆b,∆} {∆a} 7 8 {∆ab} 6 {∆ca} 15 16 Lv(3): {∆c} {∆bc} 15 14 16 14 ∆abc {∆abc} {∆abc} (1) Input (2) Step 1 (3) Step 2 {∆ab,∆a} {∆ca,∆c}{∆bc,∆b,∆} 15 16 {∆ca,∆c}{∆bc,∆b,∆} 15 14 14 {∆abc} {∆abc,∆ab,∆a} (4) Step 3 (5) Step 4 K. Y. Lee and M. H. Kim {{∆ca, Δc}, {Δbc, Δb, Δ}, {Δabc, Δab, Δa}} (6) Output 32th International Conference on Very Large Data Bases 26 Analysis of the Algorithm Lemma 1: Given a delta cube with 2n delta cuboids, the proposed heuristic algorithm produces a refresh partition n with exactly n/2 refresh chains Thus, we need to compute only a data cube with 2n cuboids K. Y. Lee and M. H. Kim n n/2 delta cuboids to refresh n n 2n 2 4 2 3 8 3 … … … 8 256 70 9 512 126 10 1024 252 n/2 32th International Conference on Very Large Data Bases 27 Analysis of the Algorithm (cont’d) Lemma 2: Let TC be a delta cuboid computation plan found by the proposed heuristic. Then the following is true. Cost(TC) < Cost(TG/2) < Cost(TG) – – – – G: a delta lattice with 2n delta cuboids G/2: a subgraph of G such that Level(0), Level(1), …, Level(n/2) are removed from G TG: the minimum spanning tree of G TG/2: the minimum spanning tree of G/2 Thus, the cost of TC is bounded by the cost of TG/2 K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 28 Performance Evaluation (1/3) Data warehouse environment – – – Oracle9i database Sun Blade 1000 with UltraSparc III CPU and 512MB RAM TPC-H benchmark schema and data Data cubes used in the experiments – Defined over lineitem table in the TPC-H schema Data Cube Q1 Q2 Q3 Dimension Attributes l_orderkey, l_partkey, l_suppkey l_orderkey, l_partkey, l_suppkey, l_shipdate l_orderkey, l_partkey, l_suppkey, l_shipdate, l_receiptdate K. Y. Lee and M. H. Kim Measure l_quantity l_quantity l_quantity 32th International Conference on Very Large Data Bases 29 Performance Evaluation (2/3) 70 Time (Sec.) 60 By varying the size of changes Previous (Propagate) Previous (Refresh) 250 Ours (Propagate) Ours (Refresh) Time (Sec.) 50 40 30 20 0 Time (Sec.) 150 100 0 2 600 Ours (Propagate) Ours (Refresh) 50 10 700 200 Previous (Propagate) Previous (Refresh) 4 6 8 Previous (Propagate) Previous (Refresh) 10 12 14 16 18 20 2 4 6 8 10 12 Delta (%) Delta (%) (a) Q1 (b) Q2 14 16 18 20 Ours (Propagate) Ours (Refresh) 500 400 300 200 100 0 2 4 6 8 10 12 14 16 18 20 Delta (%) (c) Q3 K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 30 Performance Evaluation (3/3) The number of tuples generated in the experiment Ours Previous 400000 # of tuples # of tuples 500000 300000 200000 100000 0 # of tuples 2 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 4 6 8 10 12 14 16 18 20 1400000 1200000 1000000 800000 600000 400000 200000 0 Ours Previous 2 4 6 8 10 12 Delta (%) Delta (%) (a) Q1 (b) Q2 14 16 18 20 Ours Previous 2 4 6 8 10 12 14 16 18 20 Delta (%) (c) Q3 K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 31 Summary We proposed an efficient incremental maintenance method for data cubes The proposed method can refresh a data cube with 2n n delta cuboids using only n/2 delta cuboids – The cost of computing delta cuboids can be substantially reduced We formulated the problem and developed a heuristic algorithm for this problem We showed the efficiency of the proposed method through performance evaluation K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 32 References [GBP96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the ICDE Conference, p. 152-159, 1996 [HRU96] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing Data Cubes Efficiently. In Proceedings of the ACM SIGMOD Conference, p. 205-216, 1996. [G93] Goetz Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys, Vol. 25, Issue 2, p. 73-169, 1993. [FG82] L. R. Foulds and R. L. Graham. The Steiner Problem in Phylogeny is NP-Complete. Advances in Applied Mathematics, 3: 4349, 1982. K. Y. Lee and M. H. Kim 32th International Conference on Very Large Data Bases 33