DATACUBE originale

advertisement
CUBE:
A Relational Aggregate Operator
Generalizing Group By
Aggregate
Group By
(with total)
Sum
By Color
RED
WHITE
BLUE
Cross Tab
Chevy Ford
Sum
By Color
RED
WHITE
BLUE
The Data Cube and
The Sub-Space Aggregates
By Make
Sum
By Year
By Make
Jim Gray
Adam BosworthHamid Pirahesh
Andrew LaymanIBM
Microsoft
Gray@ Microsoft.com
By Make & Year
RED
WHITE
BLUE
By Color & Year
Sum
By Make & Color
By Color
1
The Data Analysis Cycle
User extracts data from
database with query
Spread Sheet
Table
1015
Then visualizes, analyzes
data with desktop tools
Size vs Speed
Price vs Speed
104
Cache
Nearline
1
Tape Offline
Tape Main
1012
102
Secondary
Disc
Size(B)
Online $/MB
Online
109 Secondary
Tape
Disc Tape 100
Main
106
Nearline
Tape
Offline
Tape
10-2
Cache
103
10-4
-9 -6 -3 0
3
-9 -6 -3 0
3
10 10 10 10 10
10 10 10 10 10
Access Time (seconds) Access Time (seconds)
2
Division of labor
Computation vs Visualization
Relational system builds CUBE relation
– aggregation best done close to data
– Much filtering of data possible
– Cube computation may be recursive
» (e.g., percent of total, quartile, ....)
Visualization System displays/explores the cube
200
150-200
150
100-150
100
50-100
0-50
50
0
1990
Blue
1991
1992
Red
ALL
3
Relational Aggregate
Operators
SQL has several aggregate operators:
– sum(), min(), max(), count(), avg()
Other systems extend this with many others:
– stat functions, financial functions, ...
The basic idea is:
– Combine all values in a column
– into a single scalar value.
Syntax
SUM()
select sum(units)
from inventory;
4
Relational Group By Operator
Group By allows aggregates over table sub-groups
Result is a new table
Syntax: select
location, sum(units)
from
inventory
group by location
having
nation = “USA”;
Table
A
A
A
B
B
B
B
B
C
C
C
C
C
D
D
attribute
SUM()
A
B
C
D
5
Problems With This Design
Users Want Histograms
Users want sub-totals and totals
F() G() H()
sum
– drill-down & roll-up reports
Users want CrossTabs
M T W T F S S•
AIR
HOTEL
FOOD
MISC
•
Conventional wisdom
– These are not relational operators
– They are in many report writers and query engines
6
Thesis:
The Data CUBE Relational Operator
Generalizes Group By and Aggregates
Aggregate
Group By
(with total)
Sum
By Color
RED
WHITE
BLUE
Cross Tab
Chevy Ford By Color
Sum
RED
WHITE
BLUE
The Data Cube and
The Sub-Space Aggregates
FO
CH
RD
EV
Y
By M ake
Sum
By Year
0
1 99 99 1
1 99 2
1 99 3
1
By M ake
By M ake & Year
RED
WHITE
BLUE
By Color & Year
Sum
By M ake & Color
By Color
7
The Idea:
Think of the N-dimensional Cube
Each Attribute is a Dimension
N-dimensional Aggregate (sum(), max(),...)
– fits relational model exactly:
» a1, a2, ...., aN, f()
Super-aggregate over N-1 Dimensional sub-cubes
»
»
»
»
ALL, a2, ...., aN , f()
a1 , ALL, a3, ...., aN , f()
...
a1, a2, ...., ALL, f()
– this is the N-1 Dimensional cross-tab.
Super-aggregate over N-2 Dimensional sub-cubes
» ALL, ALL, a3, ...., aN , f()
» ...
» a1, a2 ,...., ALL, ALL, f()
8
An Example
SALES
Model Year Color
Chevy 1990 red
Chevy 1990 white
Chevy 1990 blue
Chevy 1991 red
Chevy 1991 white
Chevy 1991 blue
Chevy 1992 red
Chevy 1992 white
Chevy 1992 blue
Ford 1990 red
Ford 1990 white
Ford 1990 blue
Ford 1991 red
Ford 1991 white
Ford 1991 blue
Ford 1992 red
Ford 1992 white
Ford 1992 blue
Sales
5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39
CUBE
DATA CUBE
Model Year Color
ALL ALL ALL
chevy ALL ALL
ford ALL ALL
ALL 1990 ALL
ALL 1991 ALL
ALL 1992 ALL
ALL ALL red
ALL ALL white
ALL ALL blue
chevy 1990 ALL
chevy 1991 ALL
chevy 1992 ALL
ford 1990 ALL
ford 1991 ALL
ford 1992 ALL
chevy ALL red
chevy ALL white
chevy ALL blue
ford ALL red
ford ALL white
ford ALL blue
ALL 1990 red
ALL 1990 white
ALL 1990 blue
ALL 1991 red
ALL 1991 white
ALL 1991 blue
ALL 1992 red
ALL 1992 white
ALL 1992 blue
Sales
942
510
432
343
314
285
165
273
339
154
199
157
189
116
128
91
236
183
144
133
156
69
149
125
107
104
104
59
116
110
9
Why the ALL Value?
Need a new “Null” value (overloads the null indicator)
Value must not already be in the aggregated domain
Can’t use NULL since may aggregate on it.
Think of ALL as a token representing the set
– {red, white, blue}, {1990, 1991, 1992}, {Chevy, Ford}
Rules for “ALL” in other areas not explored
– assertions
– insertion / deletion / ...
– referential integrity
Follow “set of values” semantics.
10
CUBE operator: Syntax
Proposed syntax:
select model, make, year, sum(units)
from car_sales
where model in {“chevy”, “ford”}
and
year between 1990 and 1994
group by cube model, make, year
having sum(units) > 0;
Note: Group By operator repeats aggregate list
– in select list
– in group by list
11
Why This Syntax?
abstract syntax
select <field list> <aggregate list>
from <table expression>
where <search condition>
group by [ cube | drill down] <aggregate list>
having <search condition>
allows functional aggregations (e.g., sales by quarter):
select store, quarter, sum(units)
from sales
where nation = “Mexico”
group by drill down store, quarter(date) as quarter
and year = 1994;
12
Decorations and Abstractions
Sometimes want to tag cube with redundant values
– region #, region_name, sales
– region name is not a dimension, it is a decoration
– Decorations are functionally dependent on dimensions
More interesting, some “dimensions” are aggregations.
block
city
county
state
nation
Often these aggregations are not linear (are a lattice)
second
minute
hour
day
week
month
quarter
Xmas
Easter
Thanksgiving
Rather than treat time as 12
year
Holiday
dimensions
– Recognize abstractions as one dimension (like decorations)
– Compute efficiently (virtual functions)
13
Interesting Aggregate Functions
From RedBrick systems
– Rank (in sorted order)
– N-Tile (histograms)
– Running average (cumulative functions)
– Windowed running average
– Percent of total
Users want to define their own aggregate functions
– statistics
– domain specific
14
User Defined Aggregates
start
Scratchpad
end
Idea:
– User function is called at start of each group
– Each function instance has scratchpad
– Function is called at end of group
Example: SUM
– START: allocates a cell and sets it to zero
next
– NEXT: adds next value to cell
– END:
deallocates cell and returns value
– Simple example: MAX()
This idea is in Illustra, IBM’s DB2/CS, and others
Needs extension for rollup and cube
15
User Defined Aggregate Function
Generalized For Cubes
Aggregates have graduated difficulty
– Distributive: can compute cube from next lower
dimension values (count, min, max,...)
– Algebraic: can compute cube from next lower lower
scratchpads (average, ...)
– Holistic: Need base data (Median, Mode, Rank..)
Distributive and Algebraic have simple and efficient
algorithm: build higher dimensions from core
Holistic computation seems to require multiple passes.
– real systems use sampling to estimate them
» (e.g., sample to find median, quartile boundaries)
16
How To Compute the Cube?
If each attribute has Ni values
CUBE has P (Ni+1) values
Compute N-D cube with hash if fits in RAM
Compute N-D cube with sort if overflows RAM
Same comments apply to subcubes:
– compute N-D-1 subcube from N-D cube.
– Aggregate on “biggest” domain first when >1
deep
– Aggregate functions need hidden variables:
» e.g. average needs sum and count.
Use standard techniques from query processing
– arrays, hashing, hybrid hashing
– fall back on sorting.
17
Example:
Compute 2D core of 2 x 3 cube
Then compute 1D edges
Then compute 0D point
Works for algebraic and distributive functions
Saves “lots” of calls
18
Summary
CUBE operator generalizes relational aggregates
Needs ALL value to denote sub-cubes
– ALL values represent aggregation sets
Needs generalization of user-defined aggregates
Decorations and abstractions are interesting
Computation has interesting optimizations
Relationship to “rest of SQL” not fully worked out.
19
Download