Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7 1 Processing ROLAP servers vs. MOLAP servers Index Structures Cube computation What to Materialize? Algorithms Client Client Query & Analysis Metadata Warehouse Integration Source Source Source 2 ROLAP Server Relational OLAP Server sale prodId p1 p2 p1 date 1 1 2 sum 62 19 48 tools utilities ROLAP server Special indices, tuning; Schema is “denormalized” relational DBMS 3 MOLAP Server Multi-Dimensional OLAP Server Sales B A M.D. tools Product milk soda eggs soap 1 utilities multidimensional server 2 3 4 Date could also sit on relational DBMS 4 MOLAP 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. sum 5 MOLAP C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 60 16 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 52 20 A 6 Challenges in MOLAP Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays Creating array data from data in tables Efficient techniques for Cube computation Topics are discussed in the paper for reading 7 Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes 8 Inverted Lists 18 19 20 21 22 23 25 26 age index r5 r19 r37 r40 rId r4 r18 r19 r34 r35 r36 r5 r41 name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26 ... 20 23 r4 r18 r34 r35 inverted lists data records 9 Using Inverted Lists Query: Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection: r18 10 Bit Maps 20 23 20 21 22 1 1 0 1 1 0 0 0 0 23 25 26 age index bit maps 0 0 1 0 0 0 1 0 1 1 id 1 2 3 4 5 6 7 8 name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26 ... 18 19 data records 11 Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Index on Region Index on Type Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 1 1 0 0 Dealer 2 2 0 1 0 1 0 Dealer 3 1 0 0 3 0 1 4 0 0 1 4 1 0 Retail 0 1 0 5 0 1 Dealer 5 12 Using Bit Maps Query: Get people with age = 20 and name = “fred” List for age = 20: 1101100000 List for name = “fred”: 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed 13 Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT WHERE ... sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 joinTb date 1 1 1 1 2 2 prodId p1 p2 p1 p2 p1 p1 amt 12 11 50 8 44 4 name bolt nut bolt nut bolt bolt product price 10 5 10 5 10 10 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 id p1 p2 name price bolt 10 nut 5 amt 12 11 50 8 44 4 14 Join Indexes join index product sale id p1 p2 rId r1 r2 r3 r4 r5 r6 name price bolt 10 nut 5 jIndex r1,r3,r5,r6 r2,r4 prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 15 Cube Computation for Data Warehouses 16 Counting Exercise How many cuboids are there in a cube? The full or nothing case When dimension hierarchies are present What is the size of each cuboid? 17 Lattice of Cuboids 129 all c1 67 p1 c2 12 c3 50 city city, product p1 p2 c1 56 11 c2 4 8 product city, date date product, date c3 50 day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 city, product, date 18 Dimension Hierarchies all cities state city c1 c2 state CA NY city 19 Dimension Hierarchies all city city, product product city, date city, product, date date product, date state state, date state, product state, product, date not all arcs shown... 20 Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L n levels? T ( L 1) i 1 i Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize Based on size, sharing, access frequency, etc. 21 Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh 22 Idea of Materialized Views sale Define new warehouse tables/arrays prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 joinTb date 1 1 1 1 2 2 prodId p1 p2 p1 p2 p1 p1 amt 12 11 50 8 44 4 name bolt nut bolt nut bolt bolt product price 10 5 10 5 10 10 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 id p1 p2 amt 12 11 50 8 44 4 name price bolt 10 nut 5 does not exist at any source 23 Efficient OLAP Processing Determine which operations should be performed on available cuboids Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection Determine which materialized cuboid(s) should be selected for OLAP: Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? Explore indexing structures & compressed vs. dense arrays in MOLAP 24 What to Materialize? Store in warehouse results useful for common queries Example: total sales day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 p1 p2 materialize c1 56 11 c2 4 8 c3 50 ... p1 c1 67 c2 12 c3 50 129 p1 p2 c1 110 19 25 Materialization Factors Type/frequency of queries Query response time Storage cost Update cost Will study a concrete algorithm later 26 Iceberg Cube Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup Motivation Only a small portion of cube cells may be “above the water’’ in a sparse cube Only calculate “interesting” cells—data above certain threshold 27 Challenges in MOLAP Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays Creating array data from data in tables Efficient techniques for Cube computation Topics are discussed in the paper for reading 28