OLAP & Data Cubing Spring 2007 Nick Roussopoulos nick@cs.umd.edu N. Roussopoulos 2007 OLAP-The Data Analysis Cycle • User extracts data from database with query Spread Sheet Table 1015 Size vs Speed 104 Cache Nearline 1 Tape Offline Tape Main 1012 102 Secondary Disc $/MB Size(B) Online Online Secondary 9 10 Tape Disc Tape 100 • Then visualizes, analyzes data with desktop tools Main 106 103 N. Roussopoulos 2007 Price vs Speed Database Management Systems Nearline Tape Cache Offline Tape 10-2 10-4 -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds) -2- The Data Cube [Gray, Bosworth, Layman, Pirahesh ICDE 96] • summarize multidimensional data for trend analysis weather(time,latitude,longitude,altitude,temp,b-pressure) • groupby with statistical functions (avg,min,max,count,sum) aggregates over table sub-groups select avg(temp) from weather select time,altitude from weather groupby time,altitude Table • results in a new table select from group by having location, sum(units) inventory location nation = “USA”; N. Roussopoulos 2007 A A A B B B B B C C C C C D D Database Management Systems attribute SUM() A B C D -3- An Example SALES Model Year Color Chevy 1990 red Chevy 1990 white Chevy 1990 blue Chevy 1991 red Chevy 1991 white Chevy 1991 blue Chevy 1992 red Chevy 1992 white Chevy 1992 blue Ford 1990 red Ford 1990 white Ford 1990 blue Ford 1991 red Ford 1991 white Ford 1991 blue Ford 1992 red Ford 1992 white Ford 1992 blue N. Roussopoulos 2007 Sales 5 87 62 54 95 49 31 54 71 64 62 63 52 9 55 27 62 39 CUB E Database Management Systems DATA CUBE Model Year Color ALL ALL ALL chevy ALL ALL ford ALL ALL ALL 1990 ALL ALL 1991 ALL ALL 1992 ALL ALL ALL red ALL ALL white ALL ALL blue chevy 1990 ALL chevy 1991 ALL chevy 1992 ALL ford 1990 ALL ford 1991 ALL ford 1992 ALL chevy ALL red chevy ALL white chevy ALL blue ford ALL red ford ALL white ford ALL blue ALL 1990 red ALL 1990 white ALL 1990 blue ALL 1991 red ALL 1991 white ALL 1991 blue ALL 1992 red ALL 1992 white ALL 1992 blue Sales 942 510 432 343 314 285 165 273 339 154 199 157 189 116 128 91 236 183 144 133 156 69 149 125 107 104 104 59 116 110 -4- Division of labor Computation vs Visualization • Relational system builds CUBE relation aggregation best done close to data filtering of data is possible Cube computation may be recursive (e.g., percent of total, quartile, ....) • Visualization System displays/explores the cube 200 150-200 150 100-150 100 50-100 0-50 50 Blue 0 1990 N. Roussopoulos 2007 Database Management Systems 1991 1992 Red ALL -5- Problems with SQL Groubys • Histograms (aggregation over computed categories) GROUP BY CUBE F() G() H() N. Roussopoulos 2007 Database Management Systems -6- Problems with SQL Groubys • drill-down and roll-up Not relational (null values in the keys) N. Roussopoulos 2007 Database Management Systems -7- More problems with Groubys • roll-up is asymmetric (e.g. does not aggregate by year or by color alone • cross-tabulation (spreadsheets) • even if SQL syntax can be devised, a 6D cross-tab requires 64 groupby queries to generate it and 64 scans and sorts of the data most of these are not relational expressions but are in many report writers N. Roussopoulos 2007 Database Management Systems -8- CUBE: A Relational Aggregate Operator Generalizing Group By Cross Tab Group By (with total) Aggregate ChevyFord RED WHITE BLUE By Color RED WHITE BLUE Sum By Color By Make Sum Sum The Data Cube and The Sub-Space Aggregates By Year By Make By Make & Year RED WHITE BLUE By Color & Year Sum N. Roussopoulos 2007 By Make & Color By Color Database Management Systems -9- Idea: N-dimensional Cube Each Attribute is a Dimension • N-dimensionalAggregate (sum(), max(),...) fits relational model exactly: a1, a2, ...., aN, f(*) • Super-aggregate over N-1 Dimensional sub-cubes ALL, a2, ...., aN , f(*) a3 , ALL, a3, ...., aN , f(*) ... a1, a2, ...., ALL, f(*) this is the N-1 Dimensional cross-tab. • Super-aggregate over N-2 Dimensional sub-cubes ALL, ALL, a3, ...., aN , f(*) ... a1, a2 ,...., ALL, ALL, f(*) N. Roussopoulos 2007 Database Management Systems -10- Summary of the Cube • CUBE operator generalizes relational aggregates • Needs ALL value to denote sub-cubes ALL values represent aggregation sets • • • • Needs generalization of user-defined aggregates Decorations and abstractions are interesting Computation has interesting optimizations Relationship to “rest of SQL” not fully worked out. N. Roussopoulos 2007 Database Management Systems -11- Computing the (full) Cube • • Discussion from: “Computation of Multi-dimensional Aggregates”; SIGMOD 1996 Options: One SQL query for each group by on the original data Use one group by to compute another • Overlap Method (main contribution of the above paper) • • For each group-by, can use “sorting” or “hashing” Say (A, B): • • Sorting: Sort the relation first by A, then by B: make a scan and compute the aggregates one by one Hashing: Maintain the aggregates (|A| * |B|) in memory – scan the relation once and appropriately update What if we want to compute (A, B) from (A, B, C)? • Compute multiple group-by’s simultaneously More specifically: • e.g., Group-by on (A, B) can be computed from group-bys on (A, B, C) or (A, B, D) etc… If the group-by on (A, B, C) is already sorted, we can do it very efficiently Records (a1, b1, c1, aggr), (a1, b1, c2, aggr) … sequential, so need just one tuple of memory What if we want to compute (A, B) from (A, C, B)? This is trickier since all tuples needed to compute (a1, b1, aggr) are not contiguous Need memory equal to |B| tuples N. Roussopoulos 2007 Database Management Systems -12- Computing the (full) Cube • An example of memory requirements and actual computation N. Roussopoulos 2007 Database Management Systems -13- Computing the (full) Cube • An example of memory requirements and actual computation N. Roussopoulos 2007 Database Management Systems -14- Cube={Materialized Views} [Harinarayan, Rajaraman, Ullman 96] • each groupby creates a “summary table” which is a materialized view with some dressing • storing these summary tables speed up cube queries • what to store and what not • TPC-D example for sale analysis N. Roussopoulos 2007 Database Management Systems -15- The Lattice Organization • the query sales groupby part will be answered at p - cost of scanning 0.2M records pc -”6.0M -” psc -”6.0M -”- • select the views that minimize overall query performance need a good query model need a good optimization criterion N. Roussopoulos 2007 Database Management Systems -16- Views grow exponentially • in general 2**N subspaces ABCD ABC ABD AB AC A ACD BC B AD C BCD BD D none N. Roussopoulos 2007 Database Management Systems -17- CD Greedy Allocation Algorithm • optimization criterion: storage S (total capacity) query model (query frequencies to all views) find the best views to materialize • linear cost model: cost of answering Q from a materialized view A generated by QA (Ansestor) is the size of the table A cost of accessing part of a view is equal to cost of accessing all the view • for each view v in a subset of views S C(v) is the storage cost B(v,S) is the benefit of v wrt S for each w <= v (w is covered by v): • u is the min cost in S s.t. w <= u • if C(v)<C(u) then Bw=C(u)-C(v) else Bw=0 B(v,S) computes the benefit of v by considering how much it helps other views that it coversif the cost of answering thru v is better than v’s competitors, then it adds this to the total benefit of v B w wv N. Roussopoulos 2007 Database Management Systems -18- Greedy Allocation Algorithm • Can be shown to have an approximation ratio of (1 – 1/e) = 0.63… Seems straightforward application of submodularity of the objective but did not check carefully • Issues: Dealing with hierarchies for roll-ups and drill-downs • Can be incorporated w/o much trouble Cost of using a view is assumed to be linear • Usually not true • We may want to build indexes on the views • Addressed in a later paper by Gupta, Harinarayan, Rajaraman, and Ullman, 1997 Did not look at “refresh time” • Even with batch updates, we still have limited time to do the updates N. Roussopoulos 2007 Database Management Systems -19- DynaMat Yannis Kotidis, Nick Roussopoulos (Sigmod 1999) • Conventional Data Warehouse pre-computed set view is static (too hard to select and adjust) usually selected by an administrator • DynaMat proposed a framework for automatic management of views Unifies view selection & view refresh Amortizes generation and maintenance cost over multiple uses of cached results • Techniques DynMat caches the results of every query Each incoming query is evaluated against the cached results to see if any of those can be used The captured set is updated within an update cycle to the extent possible N. Roussopoulos 2007 Database Management Systems -20- DynaMat Architecture Online Operation • Try to match each query from the view pool (Fragment Locator) Fragments are either single value predicates or complete ranges A Directory Index is maintained for efficient searches • On the fly decide whether to cache the result in the pool (Admission Control Entity) N. Roussopoulos 2007 Database Management Systems -21- Materialized Range Fragments • Materialized Results are is restricted to one of • • • • • a) a full Range R_i = {min_d, max_d} b) a single value for d_i c) an empty range denotes a dimension that is not present in the query SQL queries are mapped to MR queries that are answered by cached MRFs MRFs are Coarser than query results (expanded when necessary) No combination of MRFs are used to answer a query (more costly especially when MRFs are too small and/or overlap) An R-tree based index is used to identify possible MRFs that can answer the queryamong those, the best fit is chosen The use of MRFs makes matching efficient. N. Roussopoulos 2007 Database Management Systems -22- Storage Structures & Construction of Cubes • Subcubes vs. Full cubes Subcube selection • Cost of construction and indexing • Maintenance N. Roussopoulos 2007 Database Management Systems -23- Cubetrees [Roussopoulos, Kotidis, Roussopoulos 97] • better storage organization needed materialized views and indexes are not different single storage organization for both • bulk load techniques are very important rates should be in the order of GB/hour (industrial strength) • incremental bulk updates is the MOST important issue • we had lots of experience with spatial access methods: mainly with all possible variations of R-trees (handy) • packed R-trees N. Roussopoulos 2007 Database Management Systems -24- Extended Data Cube Model C Table R(A,B,C,Q) (0,0,c,q) Relation tuple (0,b,c,q) groupby(A,C) (a,0,c,q) groupby(A,B) (0,0,c,q) T(a,b,c,q) 0 (0,b,c,q) groupby(B,C) (a,0,c,q) groupby(A) groupby(B) (0,b,0,q) (a,0,0,q) (a,0,0,q) groupby(C) groupby(none) T (a,b,c,q) (0,b,0,q) (a,b,0,q) A (a,b,0,q) B relation tuples: points in the N-d space groupby projections: also points point data is very efficient for multidimensional indexing N. Roussopoulos 2007 Database Management Systems -25- Dataless Cubetree • separate the fact table (relation) points • keep only the aggregate projection points in the cubetree to reduce the size C Table R(A,B,C,Q) (0,0,c,q) (0,b,c,q) groupby(A,C) (a,0,c, q) groupby(A,B) (0,0,c,q) groupby(B,C) 0 (0,b,c,q) groupby(A) (a,0,c,q) groupby(B) (0,b,0,q) (a,0,0,q) (a,0,0,q) groupby(C) groupby(none) (0,b,0,q) (a,b,0,q) A (a,b,0,q) B N. Roussopoulos 2007 Database Management Systems -26- BIGGEST CHALLENGE: In-place Update problem C • each record in the fact table may update exponential number of other points (in 3-d 2^3 = 8 points) (0,0,c’,q’) (0,b,c’,q’) T’(a,b,c’,q’) (0,0,c,q) record-at-a-time updates are (0,b,c,q) too expensive in terms of I/O destroys clustering of data points Kills indexes main reason for SIR < 2% (a,0,c’,q’) (a,0,c,q) T(a,b,c,q) (0,0,0,q) (0,0,0,q+q’) (a,0,0,q+q’) (a,0,0,q) A (0,b,0,q+q’) (0,b,0,q) (a,b,0,q+q’) (a,b,0,q) B N. Roussopoulos 2007 Database Management Systems -27- Future of OLAP & Data Cubing • The big rush to data warehousing has passed and left bitter taste Too costly Did not achieve the promised database integration • New applications with multi-dimensional data are needed Cost with today’s technology is much less Data integration is not easier and requires hard brain work • Promising data areas Scientific Security Web N. Roussopoulos 2007 Database Management Systems -28-