Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1) March 14, 2016 1 Data Generalization It is a process of abstracting conceptual level knowledge from large set of task-relevant data. Two types of analysis : March 14, 2016 Descriptive data mining: Describes data in a concise manner, highlighting interesting general properties. Supports interest. Predictive data mining: constructs a model and attempts to predict behavior of new data. ( classification, regression….) 2 A Data Cube (MOLAP) • Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are precomputed. • Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions. • For many cells in a cuboid the measure value is zero and cells Marchare 14, 2016 of little or no interest. Cuboids are often sparse. 3 Partial Materialization Precomputation of some of the cuboids in advance leads to fast response time and avoids redundant computations during on-line analytical processing. Data cube materialization/ pre-computation March 14, 2016 No materialization: Don’t precompute any of the non-base cuboid. Leads to multidimensional aggregation on the fly and is slow. Full materialization: Precompute all the cubes. Running queries will be very fast. Requires huge memory. Partial Materialization: Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion. 4 Outline Types of cells : Base cell, aggregate cell, cell relationship Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell Cube Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing March 14, 2016 5 A Data Cube: sales Product Base cuboid I1 I2 I3 Branch cuboid I4 I5 I6 All Branch New York 10 11 12 3 10 1 47 Chicago 11 9 6 9 6 7 48 12 9 8 5 7 3 44 13 8 10 5 6 3 45 46 37 36 22 29 14 184 Toronto Vancouver All Product cuboid March 14, 2016 Aggregate cell Base cell Apex Cuboid 6 - Types of cells Types of cells Base cell: a cell which belongs to a base cuboid Aggregate cell: a cell which belongs to a non-base cuboid Each aggregate dimension is indicated by a “*” Ancestor-descendent relationship between cells: dimensions are (branch, product, year) 1-D cell c1 = (New York, *, *, 2000) is an ancestor of a 2-D cell c2 = (New York, I1, *, 400) and a 3-D cell c3 = (New York, I1, 2013, 111). c3 is a descendent of c1 and c2; In an n-D data cube an i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a j-D cell b=(b1,b2,…bn,measure_b) if 1) i<j and 2) for 1≤m≤n am=bm whenever am "" 3) if j=i+1 a is called parent of b or b is a child of a March 14, 2016 7 - Types of cubes Full cube: All cells and cuboids are materialized. All possible combination of dimensions and values. 2 n n or L 1 i i 1 Iceberg cube: Partial materialization. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. count(*) >= min support Iceberg Condition Closed cube: No ancestor cell is created if its measure is equal to that of its descendent cell. Shell cube: Only cuboids with limited number of dimensions are created. March 14, 2016 8 Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10} How many sub-patterns for first base cell Total number of aggregate cells is 2 6 Ignore all of the aggregate cells that can be obtained by replacing some constants by “*” while keeping the same measure value. Only 3 really offer new information. {(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20} March 14, 2016 101 9 Example Which are the closed cells? can be derived from the closed cell Similarly we can also get March 14, 2016 10 Iceberg Cube, Closed Cube & Cube Shell Is iceberg cube good enough? How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number! Close cube: 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10} Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only 3 cells Cube Shell Precompute only the cuboids involving a small # of dimensions, e.g., 3 For (A1, A2, … A10), how many combinations to compute? More dimension combinations will need to be computed on the fly 11 Outline Types of cells Types of Cubes Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing March 14, 2016 12 - Efficient Computation of Data Cubes Preliminary cube computation tricks Computing full/iceberg cubes: 2 methodologies Top-Down: Multi-Way array aggregation Bottom-Up: Bottom-up computation: BUC Star-Cubing: Integrates top-down and bottom-up March 14, 2016 13 -- Preliminary Cube Computation Tricks Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. (ROLAP) Aggregates may be computed from previously computed aggregates, rather than from the base fact table Cache-results: accumulating results of already computed cuboid to reduce disk I/Os. Higher-level aggregates are computed from lower-level aggregates rather than base facts. Smallest-child: computing a cuboid from the smallest, previously computed cuboid. Cbranch C{ branch, year}, C{branch, item} Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used March 14, 2016 14 -- Multi-Way Array Aggregation … Used for MOLAP and full cube computation Array-based “bottom-up” algorithm Using multi-dimensional chunks Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization March 14, 2016 all A B AB C AC BC ABC 15 … -- Multi-way Array Aggregation … Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B b3 B13 b2 9 14 15 44 28 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 March 14, 2016 60 16 56 40 36 A What is the best traversing order to do multi-way aggregation? 52 20 16 … -- Multi-way Array Aggregation … C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 60 16 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 52 20 A March 14, 2016 17 … -- Multi-way Array Aggregation … C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 60 16 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A March 14, 2016 15 52 20 AB requires longest scan, i.e scanning of 49th chunk 18 … -- Multi-way Array Aggregation … Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively. Therefore AB is the smallest and AC is the largest 2-D planes If chunks are scanned as 1, 2, 3, … then 156,000 memory units are needed (40*400+40*1000+100*1000) If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400*4000+40*1000+10*10*100) March 14, 2016 19 … -- Multi-way Array Aggregation … all All A B AB AC ABC Needs 156,000 Memory units March 14, 2016 A C BC B AB C AC BC ABC Needs 1,641,000 Memory units 20 … -- Multi-way Array Aggregation Method: the planes should be sorted and computed according to their size in ascending order Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions March 14, 2016 If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored 21 -- Bottom-Up Computation (BUC) … all Bottom-up cube computation (Note: top-down in our view!) A Divides dimensions into partitions and facilitates iceberg pruning AB ABC AC B AD ABD C BC D CD BD ACD BCD ABCD If a partition does not satisfy min_sup, its descendants can be pruned 1 all 2A If minsup = 1 compute full CUBE! 3 AB 7 AC 10 B 14 C 16 D 9 AD 11 BC 13 BD 15 CD No simultaneous aggregation 4 ABC March 14, 2016 6 ABD 8 ACD 5 ABCD 12 BCD 22 BUC: Partitioning Usually, entire data set can’t fit in main memory Sort distinct values partition into blocks that fit Continue processing Optimizations Partitioning External Sorting, Hashing, Counting Sort Ordering dimensions to encourage pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the partitions-greater pruning opportunity Collapsing duplicates Can’t do holistic aggregates anymore! Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first. 23 --- BUC: Example (Having count(*) > 5) … Toronto New York 3 1 1 2 I1 5 1 0 1 1 I2 1 0 8 9 8 I3 1 1 1 8 Q1 Q2 Q3 Q4 New-York Toronto I1 5 1 0 1 I1 3 1 1 2 I2 1 0 8 9 I2 8 1 2 1 I3 1 1 1 8 I3 2 1 11 8 Q1 Q2 Q3 Q4 Q1 Q2 March 14, 2016 Q3 Q4 24 … --- BUC: Example (Having count(*) > 5) All 77 All 1 7 B 6 P,B P 5 4 Q,B B,P,Q March 14, 2016 Q 2 Q 20 5 Q1 Q2 23 29 Q3 Q4 3 Q,P Q,P I1 8 2 1 3 I2 9 1 10 10 3 2 12 16 Q1 Q2 Q3 Q4 I3 25 Till Now Aggregates simultaneously on multiple dimensions. Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation. March 14, 2016 Facilitates a-priori pruning. During partitioning, each partition’s count is compared with min sup. The recursion stops if the count does not satisfy min sup. 26 Summary Data Cube Materialization Data Cube Computation Methods Full Materialization Partial Materialization: iceberg cubes, shell fragments Multiway array aggregation BUC for computing iceberg cubes Next Class March 14, 2016 Star Cubing Shell Fragments for Fast High-Dimensional OLAP Exploration and Discovery in Multidimensional Databases 27