Multi-way Algorithm for Cube Computation CPS 196.03 Notes 8 1 First Programming Project Individual project, 15 Points in final grade Sales(customer_id, item_id, item_group, item_price, purchase_date) Will be provided as a file during demo and for generating performance numbers for project report Task 1: 5 Points Interface to enter MIN_SUPPORT (% of customers) Find frequent itemsets using Apriori (set of item_id’s) Task 2: 5 Points (Section 5.5 in the textbook) Interface to enter two constraint types (e.g., SUM(item_price) op const) Use the constraints in Apriori as effectively as possible, study and demonstrate performance improvement Task 3: 5 Points Extension of your choice. Examples include (i) association rules, (ii) complex constraints, (iii) sequential patterns, (iv) variants of apriori, (v) FP-growth 2 File Format 10,123,3,54,4/4/2008 10,12,4,101,4/5/2008 14,123,3,54,8/4/2008 … Caveats: Customer Three Vs. Item datasets: Toy, Medium, and Large Comma-separated file, one purchase per line in file, no header in file Integers Note for simplicity date format 3 First Programming Project: Milestones Feb 3: Project announced Feb 17: Mid-project report due Describe progress and planned extensions Describe detailed algorithms for all three tasks Feb 17: Sample data file will be provided for generating performance results for project report March 2: Submit code, README file to run code, code documentation, and final project report March 2-4: Project demos (random assignment) March 6: Spring break. Second project announced 4 Finalized Grading Criteria for Class Homeworks: 15 points Programming projects: 40 points Midterm: 20 points Note: Midterm is on Feb 19 (Thu) in class Final: 25 Points 5 ROLAP Server Relational OLAP Server sale prodId p1 p2 p1 date 1 1 2 sum 62 19 48 tools utilities ROLAP server Special indices, tuning; Schema is “denormalized” relational DBMS 6 MOLAP Server Multi-Dimensional OLAP Server Sales B A M.D. tools Product milk soda eggs soap 1 utilities multidimensional server 2 3 4 Date could also sit on relational DBMS 7 MOLAP 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. sum 8 MOLAP C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 60 16 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 52 20 A 9 Challenges in MOLAP Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays Creating array data from data in tables Efficient techniques for Cube computation Topics are discussed in the paper for reading 10 ROLAP Vs. MOLAP What do the authors say? What can you do in MOLAP that you cannot do in ROLAP? Can the algorithm in this paper be used in ROLAP? 11 Array Storage Chunks Compression Chunk-offset compression Vs. LZW 12 Loading Arrays from Tables The easy case: array fits in memory Else: Partitions 13 Loading Arrays from Tables Table ... Suppose there are 1000 chunks. 10 chunks can fit in memory. The partition size is 10 chunks ... 100 10 chunks 14 Basic Array Cubing Algo First find minimum spanning tree Hierarchy of aggregates Compute each (k-1) dimensional aggregate from its best k dimensional aggregate One pass through the array in the right order Let us look at some basics first 15 Chunked 3D Array A a3 61 62 63 64 a2 45 46 47 48 a1 29 30 31 32 a0 b3 B b2 B13 14 15 60 16 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 c2 c3 56 40 36 C 52 20 Dimension order CBA 16 “a0b0” chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 b0 a0 xxxx a0 b0 c0 c1 c2 x x x c3 x c0 c1 c2 x x x c3 x b1 b2 b3 a0b3c0 c1 c2 c3 … 17 a0b1 chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 b1 a0 yyyy a0 c0 c1 c2 xy xy xy c3 xy c0 c1 c2 b0 x x x x b1 y y y y Done with a0b0 c3 b2 b3 a0b3c0 c1 c2 c3 … 18 a0b2 chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 b2 a0 zzzz a0 c0 c1 c2 c3 xyz xyz xyz xyz c0 c1 c2 b0 x x x x b1 y y y y b2 z z z z Done with a0b1 c3 b3 a0b3c0 c1 c2 c3 … 19 Table Visualization a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 c0 c1 c2 c3 xyzu xyzu xyzu xyzu c0 c1 c2 b0 x x x x b1 y y y y b2 z z z z b3 u u u u b3 a0 uuuu a0 Done with a0b2 c3 a0b3c0 c1 c2 c3 20 Table Visualization … a1b0c0 c1 c2 c3 a1b1c0 c1 c2 c3 a1b2c0 c1 c2 c3 b0 a1 xxxx a1 c0 c1 c2 x x x c3 Done with a0b3 Done with a0c* x c0 c1 c2 b0 xx xx xx xx b1 y y y y b2 z z z z b3 u u u u c3 a1b3c0 c1 c2 c3 … 21 a3b3 chunk (last one) … a3b0c0 c1 c2 c3 a3b1c0 c1 c2 c3 a3b2c0 c1 c2 c3 a3b3c0 c1 c2 c3 b0 a3 uuuu a3 c0 c1 c2 c3 xyzu xyzu xyzu xyzu c0 c1 c2 Done with a0b3 Done with a0c* Done with b*c* c3 b0 xxxx xxxx xxxx xxxx b1 yyyy yyyy yyyy yyyy b2 zzzz zzzz zzzz zzzz b3 uuuu uuuu uuuu uuuu Finish 22 Memory Used A: 40 distinct values B: 400 distinct values C: 4000 distinct values CBA: Dimension Order Plane AB: Need 1 chunk (10 * 100 * 1) Plane AC: Need 4 chunks (10 * 1000 * 4) Plane BC: Need 16 chunks (100 * 1000 * 16) Total memory: 1,641,000 23 Memory Used A: 40 distinct values B: 400 distinct values C: 4000 distinct values ABC: Dimension Order Plane BC: Need 1 chunk (1000 * 100 * 1) Plane AC: Need 4 chunks (1000 * 10 * 4) Plane AB: Need 16 chunks (100 * 10 * 16) Total memory: 156,000 24 Basic Array Cubing Algo First find minimum spanning tree Hierarchy Compute each (k-1) dimensional aggregate from its best k dimensional aggregate One of aggregates pass through the array in the right order What are the advantages and disadvantages of this algorithm? 25 Multi-way Array Cubing Algo What is the main idea? Rule 1 on Page 163 Minimum memory spanning tree Figure 2 Figures 3 and 4 Theorem 1 Basic idea of multi-pass algorithm Tradeoff between memory usage and number of passes 26 D1 D2 D3 M A3 B2 C1 20 A7 B2 C1 10 A13 B1 C12 30 A2 B2 C1 10 A3 B7 C12 40 A15 B7 C1 20 A6 B1 C12 10 A13 B2 C1 20 A1 B11 C1 100 A1 B11 C1 50 A13 B2 C1 30 A3 B11 C12 10 A13 B7 C1 40 A10 B1 C1 50 A3 B1 C12 10 27