PrefixCube: Prefix-sharing Condensed Data Cube Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. fengjl@mail.hust.edu.cn Nov 12, 2004 Outline Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons Conclusions DOLAP 2004 2 Jianlin Feng Introduction Data Cube (ICDE’96) – N-dimensional cube(A1, A2, …, AN) – 2N cuboids, i.e. GROUP-BYs The Huge Size Problem – When R is sparse, the size of a cuboid is possibly close to the size of R. – The I/O cost even for storing the cube result tuples becomes dominative. DOLAP 2004 3 Jianlin Feng Related Work Condensed Cube (ICDE’02) Dwarf (SIGMOD’02) Quotient Cube (VLDB’02) QC-Tree (SIGMOD’03) Basic idea: remove redundancies existing among cube tuples. – prefix redundancy – suffix redundancy DOLAP 2004 4 Jianlin Feng Prefix redundancy Given an example cube(A, B, C) – Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC) – Possibly many times in each cuboid except cuboid(A) Inter-cuboid and Intra-cuboid prefix redundancy DOLAP 2004 5 Jianlin Feng Suffix Redundancy Occurs when cube tuples belonging to different cuboids are actually aggregated from the same group of base relation tuples. An extreme case – Let the source relation R have only one single tuple r(a1, a2, …, an, m); – 2n cube tuples can be condensed into one physical tuple: (a1, a2, …, an, V), where V = aggr(r); – together with some information indicating that it is a representative tuple. DOLAP 2004 6 Jianlin Feng Thinking… Condensed cube – It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size. Dwarf – Besides suffix coalescing, i.e. multi-basetuple condensing, it also realized full prefixsharing so as to achieve high cube size reducing effectiveness. DOLAP 2004 7 Jianlin Feng Motivation HOW to further reduce condensed cube’s size while taking into account query characteristics we intend to answer range query? Augmenting BST-condensing with removing of intra-cuboid prefix redundancy! DOLAP 2004 8 Jianlin Feng Ordered Datacube Model Value ALL(or *) is encoded as 0. A dimension D and its cardinality C – each dimension value is one-to-one mapped to an integer value between 1 and C inclusively. N dimensions form a N-dimensional space. The origin O(0, 0, …, 0) represents the grand total. DOLAP 2004 9 Jianlin Feng Ordered Datacube Model Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries. DOLAP 2004 10 Jianlin Feng BST-Condensed Cube Base Single Tuple (BST) t1 t2 t3 A 8 1 1 B 1 8 2 C 1 1 3 M 100 50 60 – t1 is a BST on SD {A} and {B} – t2 is a BST on SD {B} A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube. DOLAP 2004 11 Jianlin Feng BU-BST Condensed Cube BottomUpBST algorithms (ICDE’02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple from condensed cube compared with MinCube. Note: BST Condensing is a special kind of Prefix-sharing ! A B C M 8 8 8 8 * 1 * 1 * * 1 1 10 10 10 10 A group of cube tuples with sharing prefix are represented by a BST! ct7 DOLAP 2004 12 A B C M SD 8 1 1 10 {A} Jianlin Feng A BU-BST Condensed Cube Example Note: Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5 t1 t2 t3 A 8 1 1 DOLAP 2004 B 1 8 2 C 1 1 3 M 100 50 60 ct1 ct2 ct3 ct4 ct5 ct6 ct7 ct8 ct9 ct10 ct11 ct12 13 A * 1 1 1 1 1 8 * * * * * B * * 2 8 * * 1 1 2 8 * * C * * 3 1 1 3 1 1 3 1 1 3 M 210 110 60 50 50 60 100 100 60 50 150 60 SID CID ALL A AB AB AC AC A B B B C C Jianlin Feng Prefix-sharing Condensed Cube - PrefixCube Prefix-sharing BST Condensing + Intra-cuboid prefix-sharing PrefixCube DOLAP 2004 14 Jianlin Feng A PrefixCube Example N-Roots V-Roots CID = ALL 210 CID = A CID = AC CID = A 1 110 1 1 150 1 50 SID = A SID = AB 3 60 SID = B 8 1 1 1 2 1 100 DOLAP 2004 3 60 15 8 2 8 1 50 1 50 3 60 3 60 1 100 Jianlin Feng Corresponding Dwarf 8 1 1 (node2) (node1) A Dimension 8 2 1 1 50 3 60 110 8 2 (node3) B Dimension 1 150 3 60 210 3 60 60 1 50 50 1 100 100 (node4) C Dimension DOLAP 2004 16 Jianlin Feng PrefixCube vs. Dwarf PrefixCube Dwarf Prefix-sharing Intra-cuboid Inter- and Intra-cuboid Suffix Coalescing BST Condensing Multi-tuple Condensing Compression Ratio Lower Higher Saving extra value ALL? No Yes Tuple clustered by cuboid? Yes No DOLAP 2004 17 PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics! Jianlin Feng Effectiveness of Size Reduction Datasets 100% 100% 80% 80% Size Ratio Size Ratio – synthetic datasets with uniform distribution – # of tuples: 1,000,000 60% 40% BU-BST PrefixCube 20% 60% 40% BU-BST PrefixCube 20% 0% 0% 2 3 4 5 6 7 8 9 2 Number of Dimensions 4 5 6 7 8 9 Number of Dimensions (a) Cardinality = 100 DOLAP 2004 3 (b) Cardinality = 1000 18 Jianlin Feng Effectiveness of Size Reduction PrefixBUC – Full Cube (computed by BUC) – Prefix-sharing 100% Size Ratio 80% 60% 40% C=100 C=1000 20% 0% 2 3 4 5 6 7 8 9 Number of Dimensions DOLAP 2004 19 Jianlin Feng Impact of Data Density Datasets – – – – Uniform distribution # of dimensions: 6 Cardinality of dimensions: 100 # of tuples: range from 1,000 to 1,000,000 100% Size Ratio 80% 60% 40% 20% BU-BST P refixCube P refixBUC 0% 1.E+03 1.E+04 1.E+05 1.E+06 Number of Tuples DOLAP 2004 20 Jianlin Feng Impact of Data Skewness Datasets – Zipf distribution – # of tuples: 1,000,000 – Cardinality of dimensions: range from 1,000 to 500 with 100 interval – Zipf factor: range from 0 to 0.8 with 0.2 interval 100% Size Ratio 80% 60% 40% BU-BST P refixCube P refixBUC 20% 0% 0 0.2 0.4 0.6 0.8 Zipf Factors DOLAP 2004 21 Jianlin Feng Real-world Dataset Datasets – Weather Datasets – # of tuples: 1,015,367 700 100% BUC BU-BST P refixCube 600 Time(sec.) Size Ratio 80% 60% 40% BU-BST P refixCube P refixBUC 20% 500 400 300 200 100 0 0% 2 3 4 5 6 7 8 2 9 4 5 6 7 8 9 Number of Dimensions Number of Dimensions DOLAP 2004 3 22 Jianlin Feng Conclusion A new cube structure PrefixCube was proposed by augmenting BU-BST condensing with intra-cuboid prefixsharing. – It can greatly reduce data cube’s size compared with BU-BST condensed cube. – It can also reduce the impact of data skew on BU-BST condensing. – It can make a quite stable size reduction on both dense and sparse datasets. DOLAP 2004 23 Jianlin Feng The End Thank u! Any question? DOLAP 2004 24 Jianlin Feng