CUBE: A Relational Aggregate Operator Generalizing Group By Aggregate Group By (with total) Sum By Color RED WHITE BLUE Cross Tab Chevy Ford Sum By Color RED WHITE BLUE The Data Cube and The Sub-Space Aggregates By Make Sum By Year By Make Jim Gray Adam Bosworth Andrew Layman Microsoft Gray@ Microsoft.com By Make & Year Hamid Pirahesh IBM RED WHITE BLUE By Color & Year Sum By Make & Color By Color 1 The Data Analysis Cycle User extracts data from database with query Spread Sheet Extract analyze Table 1015 Size vs Speed Price vs Speed 104 Cache Nearline 1 Tape Offline Tape Main 1012 102 Secondary Disc Size(B) Online $/MB Online Secondary 9 10 Tape Disc Tape 100 visualize Then visualizes, analyzes data with desktop tools Main 106 Nearline Tape Offline Tape 10-2 Cache 103 10-4 -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds) 2 Division of labor Computation vs Visualization Relational system builds CUBE relation – aggregation best done close to data – Much filtering of data possible – Cube computation may be recursive » (e.g., percent of total, quartile, ....) Visualization System displays/explores the cube 200 150-200 150 100-150 100 50-100 0-50 50 0 1990 Blue 1991 1992 Red ALL 3 Relational Aggregate Operators SQL has several aggregate operators: – sum(), min(), max(), count(), avg() Other systems extend this with many others: – stat functions, financial functions, ... The basic idea is: – Combine all values in a column – into a single scalar value. Syntax SUM() select sum(units) from inventory; 4 Relational Group By Operator Group By allows aggregates over table sub-groups Result is a new table Syntax: select location, sum(units) from inventory group by location having nation = “USA”; Table A A A B B B B B C C C C C D D attribute SUM() A B C D 5 Problems With This Design Users Want Histograms Users want sub-totals and totals F() G() H() sum – drill-down & roll-up reports Users want CrossTabs M T W T F S S• AIR HOTEL FOOD MISC • Conventional wisdom – These are not relational operators – They are in many report writers and query engines 6 Thesis: The Data CUBE Relational Operator Generalizes Group By and Aggregates Aggregate Group By (with total) Sum By Color RED WHITE BLUE Cross Tab Chevy Ford By Color Sum RED WHITE BLUE The Data Cube and The Sub-Space Aggregates FO CH RD EV Y By M ake Sum By Year 0 1 99 99 1 1 99 2 1 99 3 1 By M ake By M ake & Year RED WHITE BLUE By Color & Year Sum By M ake & Color By Color 7 The Idea: Think of the N-dimensional Cube Each Attribute is a Dimension N-dimensional Aggregate (sum(), max(),...) – fits relational model exactly: » a1, a2, ...., aN, f() Super-aggregate over N-1 Dimensional sub-cubes » » » » ALL, a2, ...., aN , f() a3 , ALL, a3, ...., aN , f() ... a1, a2, ...., ALL, f() – this is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes » ALL, ALL, a3, ...., aN , f() » ... » a1, a2 ,...., ALL, ALL, f() 8 An Example SALES Model Year Color Chevy 1990 red Chevy 1990 white Chevy 1990 blue Chevy 1991 red Chevy 1991 white Chevy 1991 blue Chevy 1992 red Chevy 1992 white Chevy 1992 blue Ford 1990 red Ford 1990 white Ford 1990 blue Ford 1991 red Ford 1991 white Ford 1991 blue Ford 1992 red Ford 1992 white Ford 1992 blue Sales 5 87 62 54 95 49 31 54 71 64 62 63 52 9 55 27 62 39 CUBE DATA CUBE Model Year Color ALL ALL ALL chevy ALL ALL ford ALL ALL ALL 1990 ALL ALL 1991 ALL ALL 1992 ALL ALL ALL red ALL ALL white ALL ALL blue chevy 1990 ALL chevy 1991 ALL chevy 1992 ALL ford 1990 ALL ford 1991 ALL ford 1992 ALL chevy ALL red chevy ALL white chevy ALL blue ford ALL red ford ALL white ford ALL blue ALL 1990 red ALL 1990 white ALL 1990 blue ALL 1991 red ALL 1991 white ALL 1991 blue ALL 1992 red ALL 1992 white ALL 1992 blue Sales 942 510 432 343 314 285 165 273 339 154 199 157 189 116 128 91 236 183 144 133 156 69 149 125 107 104 104 59 116 110 9 Why the ALL Value? Need a new “Null” value (overloads the null indicator) Value must not already be in the aggregated domain Can’t use NULL since may aggregate on it. Think of ALL as a token representing the set – {red, white, blue}, {1990, 1991, 1992}, {Chevy, Ford} Rules for “ALL” in other areas not explored – assertions – insertion / deletion / ... – referential integrity Follow “set of values” semantics. 10 CUBE operator: Syntax Proposed syntax: select model, make, year, sum(units) from car_sales where model in {“chevy”, “ford”} and year between 1990 and 1994 group by model, make, year with having sum(units) > 0; cube Note: Group By operator repeats aggregate list – in select list – in group by list 11 Decorations and Abstractions Sometimes want to tag cube with redundant values – region #, region_name, sales – region name is not a dimension, it is a decoration – Decorations are functionally dependent on dimensions More interesting, some “dimensions” are aggregations. block city county state nation Often these aggregations are not linear (are a lattice) second minute hour day week month quarter Xmas Easter Thanksgiving Rather than treat time as 12 year Holiday dimensions – Recognize abstractions as one dimension (like decorations) – Compute efficiently (virtual functions) 13 Interesting Aggregate Functions From RedBrick systems – Rank (in sorted order) – N-Tile (histograms) – Running average (cumulative functions) – Windowed running average – Percent of total Users want to define their own aggregate functions – statistics – domain specific 14 User Defined Aggregates start Scratchpad end Idea: – User function is called at start of each group – Each function instance has scratchpad – Function is called at end of group next Example: SUM – START: allocates a cell and sets it to zero – NEXT: adds next value to cell – END: deallocates cell and returns value – Simple example: MAX() This idea is in Illustra, IBM’s DB2/CS, and SQL standard Needs extension for rollup and cube 15 User Defined Aggregate Function Generalized For Cubes Aggregates have graduated difficulty –Distributive: can compute cube from next lower dimension values (count, min, max,...) –Algebraic: can compute cube from next lower lower scratchpads (average, ...) –Holistic: Need base data (Median, Mode, Rank..) Distributive and Algebraic have simple and efficient algorithm: build higher dimensions from core Holistic computation seems to require multiple passes. – real systems use sampling to estimate them » (e.g., sample to find median, quartile boundaries) 16 How To Compute the Cube? If each attribute has Ni values CUBE has P (Ni+1) values Compute N-D cube with hash if fits in RAM Compute N-D cube with sort if overflows RAM Same comments apply to subcubes: – compute N-D-1 subcube from N-D cube. – Aggregate on “biggest” domain first when >1 deep – Aggregate functions need hidden variables: » e.g. average needs sum and count. Use standard techniques from query processing – arrays, hashing, hybrid hashing – fall back on sorting. 17 Example: Compute 2D core of 2 x 3 cube Then compute 1D edges Then compute 0D point Works for algebraic and distributive functions Saves “lots” of calls 18 Summary CUBE operator generalizes relational aggregates Needs ALL value to denote sub-cubes – ALL values represent aggregation sets Needs generalization of user-defined aggregates Decorations and abstractions are interesting Computation has interesting optimizations Research Topics Generalize Spreadsheet Pivot operator to RDBs Characterize Algebraic/Distributive/Holistic functions for update 19