[slides 2 - PPT]

advertisement
OLAP & Data Cubing
Spring 2007
Nick Roussopoulos
nick@cs.umd.edu
 N. Roussopoulos 2007
OLAP-The Data Analysis Cycle
• User extracts data from
database with query
Spread Sheet
Table
1015
Size vs Speed
104
Cache
Nearline
1
Tape Offline
Tape Main
1012
102
Secondary
Disc
$/MB
Size(B)
Online
Online
Secondary
9
10
Tape
Disc Tape 100
• Then visualizes, analyzes
data with desktop tools
Main
106
103
 N. Roussopoulos 2007
Price vs Speed
Database Management Systems
Nearline
Tape
Cache
Offline
Tape
10-2
10-4
-9
-6
-3
0
3
-9
-6
-3
0
3
10 10 10 10 10
10 10 10 10 10
Access Time (seconds)
Access Time (seconds)
-2-
The Data Cube
[Gray, Bosworth, Layman, Pirahesh ICDE 96]
• summarize multidimensional data for trend analysis
weather(time,latitude,longitude,altitude,temp,b-pressure)
•
groupby with statistical functions (avg,min,max,count,sum) aggregates over
table sub-groups
select avg(temp) from weather
select time,altitude from weather
groupby time,altitude
Table
•
results in a new table
select
from
group by
having
location, sum(units)
inventory
location
nation = “USA”;
 N. Roussopoulos 2007
A
A
A
B
B
B
B
B
C
C
C
C
C
D
D
Database Management Systems
attribute
SUM()
A
B
C
D
-3-
An Example
SALES
Model Year Color
Chevy 1990 red
Chevy 1990 white
Chevy 1990 blue
Chevy 1991 red
Chevy 1991 white
Chevy 1991 blue
Chevy 1992 red
Chevy 1992 white
Chevy 1992 blue
Ford 1990 red
Ford 1990 white
Ford 1990 blue
Ford 1991 red
Ford 1991 white
Ford 1991 blue
Ford 1992 red
Ford 1992 white
Ford 1992 blue
 N. Roussopoulos 2007
Sales
5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39
CUB
E
Database Management Systems
DATA CUBE
Model Year Color
ALL ALL ALL
chevy ALL ALL
ford ALL ALL
ALL 1990 ALL
ALL 1991 ALL
ALL 1992 ALL
ALL ALL red
ALL ALL white
ALL ALL blue
chevy 1990 ALL
chevy 1991 ALL
chevy 1992 ALL
ford 1990 ALL
ford 1991 ALL
ford 1992 ALL
chevy ALL red
chevy ALL white
chevy ALL blue
ford ALL red
ford ALL white
ford ALL blue
ALL 1990 red
ALL 1990 white
ALL 1990 blue
ALL 1991 red
ALL 1991 white
ALL 1991 blue
ALL 1992 red
ALL 1992 white
ALL 1992 blue
Sales
942
510
432
343
314
285
165
273
339
154
199
157
189
116
128
91
236
183
144
133
156
69
149
125
107
104
104
59
116
110
-4-
Division of labor
Computation vs Visualization
• Relational system builds CUBE relation
 aggregation best done close to data
 filtering of data is possible
 Cube computation may be recursive
(e.g., percent of total, quartile, ....)
• Visualization System displays/explores the cube
200
150-200
150
100-150
100
50-100
0-50
50
Blue
0
1990
 N. Roussopoulos 2007
Database Management Systems
1991
1992
Red
ALL
-5-
Problems with SQL Groubys
• Histograms (aggregation over computed categories)
GROUP BY CUBE
F() G() H()
 N. Roussopoulos 2007
Database Management Systems
-6-
Problems with SQL Groubys
• drill-down and roll-up
Not relational
(null values in the keys)
 N. Roussopoulos 2007
Database Management Systems
-7-
More problems with Groubys
• roll-up is asymmetric (e.g. does not aggregate by year or by color
alone
• cross-tabulation (spreadsheets)
• even if SQL syntax can be devised, a 6D cross-tab requires 64 groupby
queries to generate it and 64 scans and sorts of the data
 most of these are not relational expressions but are in many report
writers
 N. Roussopoulos 2007
Database Management Systems
-8-
CUBE:
A Relational Aggregate Operator Generalizing Group By
Cross Tab
Group By
(with total)
Aggregate
ChevyFord
RED
WHITE
BLUE
By Color
RED
WHITE
BLUE
Sum
By Color
By Make
Sum
Sum
The Data Cube and
The Sub-Space Aggregates
By Year
By Make
By Make & Year
RED
WHITE
BLUE
By Color & Year
Sum
 N. Roussopoulos 2007
By Make & Color
By Color
Database Management Systems
-9-
Idea: N-dimensional Cube
Each Attribute is a Dimension
• N-dimensionalAggregate (sum(), max(),...)
 fits relational model exactly:
a1, a2, ...., aN, f(*)
• Super-aggregate over N-1 Dimensional sub-cubes
ALL, a2, ...., aN , f(*)
a3 , ALL, a3, ...., aN , f(*)
...
a1, a2, ...., ALL, f(*)
 this is the N-1 Dimensional cross-tab.
• Super-aggregate over N-2 Dimensional sub-cubes
ALL, ALL, a3, ...., aN , f(*)
...
a1, a2 ,...., ALL, ALL, f(*)
 N. Roussopoulos 2007
Database Management Systems
-10-
Summary of the Cube
• CUBE operator generalizes relational aggregates
• Needs ALL value to denote sub-cubes
 ALL values represent aggregation sets
•
•
•
•
Needs generalization of user-defined aggregates
Decorations and abstractions are interesting
Computation has interesting optimizations
Relationship to “rest of SQL” not fully worked out.
 N. Roussopoulos 2007
Database Management Systems
-11-
Computing the (full) Cube
•
•
Discussion from: “Computation of Multi-dimensional Aggregates”; SIGMOD 1996
Options:


One SQL query for each group by on the original data
Use one group by to compute another
•

Overlap Method (main contribution of the above paper)
•
•
For each group-by, can use “sorting” or “hashing”
Say (A, B):
•
•
Sorting: Sort the relation first by A, then by B: make a scan and compute the aggregates one by one
Hashing: Maintain the aggregates (|A| * |B|) in memory – scan the relation once and appropriately update
What if we want to compute (A, B) from (A, B, C)?


•
Compute multiple group-by’s simultaneously
More specifically:


•
e.g., Group-by on (A, B) can be computed from group-bys on (A, B, C) or (A, B, D) etc…
If the group-by on (A, B, C) is already sorted, we can do it very efficiently
Records (a1, b1, c1, aggr), (a1, b1, c2, aggr) … sequential, so need just one tuple of memory
What if we want to compute (A, B) from (A, C, B)?


This is trickier since all tuples needed to compute (a1, b1, aggr) are not contiguous
Need memory equal to |B| tuples
 N. Roussopoulos 2007
Database Management Systems
-12-
Computing the (full) Cube
•
An example of memory requirements and actual computation
 N. Roussopoulos 2007
Database Management Systems
-13-
Computing the (full) Cube
•
An example of memory requirements and actual computation
 N. Roussopoulos 2007
Database Management Systems
-14-
Cube={Materialized Views}
[Harinarayan, Rajaraman, Ullman 96]
• each groupby creates a “summary table” which is a
materialized view with some dressing
• storing these summary tables speed up cube queries
• what to store and what not
• TPC-D example for sale analysis
 N. Roussopoulos 2007
Database Management Systems
-15-
The Lattice Organization
• the query sales groupby part will be answered at
 p - cost of scanning 0.2M records
 pc -”6.0M -” psc -”6.0M -”-
• select the views that minimize overall query performance
 need a good query model
 need a good optimization criterion
 N. Roussopoulos 2007
Database Management Systems
-16-
Views grow exponentially
• in general 2**N subspaces
ABCD
ABC
ABD
AB
AC
A
ACD
BC
B
AD
C
BCD
BD
D
none
 N. Roussopoulos 2007
Database Management Systems
-17-
CD
Greedy Allocation Algorithm
• optimization criterion:
 storage S (total capacity)
 query model (query frequencies to all views)
 find the best views to materialize
• linear cost model:
 cost of answering Q from a materialized view A generated by QA
(Ansestor) is the size of the table A
 cost of accessing part of a view is equal to cost of accessing all the view
• for each view v in a subset of views S
 C(v) is the storage cost
 B(v,S) is the benefit of v wrt S
 for each w <= v (w is covered by v):
• u is the min cost in S s.t. w <= u
• if C(v)<C(u) then Bw=C(u)-C(v) else Bw=0

B(v,S) 
computes the benefit of v by
considering how much it helps other
views that it coversif the cost of answering thru v is
better than v’s competitors, then it
adds this to the total benefit of v
B
w
wv
 N. Roussopoulos 2007
Database Management Systems
-18-
Greedy Allocation Algorithm
• Can be shown to have an approximation ratio of (1 – 1/e) = 0.63…
 Seems straightforward application of submodularity of the objective but
did not check carefully
• Issues:
 Dealing with hierarchies for roll-ups and drill-downs
• Can be incorporated w/o much trouble
 Cost of using a view is assumed to be linear
• Usually not true
• We may want to build indexes on the views
• Addressed in a later paper by Gupta, Harinarayan, Rajaraman, and Ullman,
1997
 Did not look at “refresh time”
• Even with batch updates, we still have limited time to do the updates
 N. Roussopoulos 2007
Database Management Systems
-19-
DynaMat
Yannis Kotidis, Nick Roussopoulos (Sigmod 1999)
• Conventional Data Warehouse
 pre-computed set view is static (too hard to select and adjust)
 usually selected by an administrator
• DynaMat proposed a framework for automatic
management of views
 Unifies view selection & view refresh
 Amortizes generation and maintenance cost over multiple uses of
cached results
• Techniques
 DynMat caches the results of every query
 Each incoming query is evaluated against the cached results to see
if any of those can be used
 The captured set is updated within an update cycle to the extent
possible
 N. Roussopoulos 2007
Database Management Systems
-20-
DynaMat Architecture
Online Operation
• Try to match each query from the
view pool (Fragment Locator)
 Fragments are either single value
predicates or complete ranges
 A Directory Index is maintained for
efficient searches
• On the fly decide whether to cache
the result in the pool (Admission
Control Entity)
 N. Roussopoulos 2007
Database Management Systems
-21-
Materialized Range Fragments
• Materialized Results are is restricted to one of
•
•
•
•
•
a) a full Range R_i = {min_d, max_d}
b) a single value for d_i
c) an empty range denotes a dimension
that is not present in the query
SQL queries are mapped to MR queries that
are answered by cached MRFs
MRFs are Coarser than query results
(expanded when necessary)
No combination of MRFs are used to answer a
query (more costly especially when MRFs are
too small and/or overlap)
An R-tree based index is used to identify
possible MRFs that can answer the queryamong those, the best fit is chosen
The use of MRFs makes matching efficient.
 N. Roussopoulos 2007
Database Management Systems
-22-
Storage Structures & Construction of Cubes
• Subcubes vs. Full cubes
 Subcube selection
• Cost of construction and indexing
• Maintenance
 N. Roussopoulos 2007
Database Management Systems
-23-
Cubetrees
[Roussopoulos, Kotidis, Roussopoulos 97]
• better storage organization needed
 materialized views and indexes are not different
 single storage organization for both
• bulk load techniques are very important
 rates should be in the order of GB/hour (industrial strength)
• incremental bulk updates is the MOST important issue
• we had lots of experience with spatial access methods:
mainly with all possible variations of R-trees (handy)
• packed R-trees
 N. Roussopoulos 2007
Database Management Systems
-24-
Extended Data Cube Model
C
Table R(A,B,C,Q)
(0,0,c,q)
Relation tuple
(0,b,c,q)
groupby(A,C)
(a,0,c,q)
groupby(A,B)
(0,0,c,q)
T(a,b,c,q)
0
(0,b,c,q)
groupby(B,C)
(a,0,c,q)
groupby(A)
groupby(B)
(0,b,0,q)
(a,0,0,q)
(a,0,0,q)
groupby(C)
groupby(none)
T (a,b,c,q)
(0,b,0,q)
(a,b,0,q)
A
(a,b,0,q)
B



relation tuples: points in the N-d space
groupby projections: also points
point data is very efficient for multidimensional indexing
 N. Roussopoulos 2007
Database Management Systems
-25-
Dataless Cubetree
• separate the fact table (relation) points
• keep only the aggregate projection points in the
cubetree to reduce the size
C
Table R(A,B,C,Q)
(0,0,c,q)
(0,b,c,q)
groupby(A,C)
(a,0,c,
q)
groupby(A,B)
(0,0,c,q)
groupby(B,C)
0
(0,b,c,q)
groupby(A)
(a,0,c,q)
groupby(B)
(0,b,0,q)
(a,0,0,q)
(a,0,0,q)
groupby(C)
groupby(none)
(0,b,0,q)
(a,b,0,q)
A
(a,b,0,q)
B
 N. Roussopoulos 2007
Database Management Systems
-26-
BIGGEST CHALLENGE:
In-place Update problem
C
• each record in the fact table
may update exponential number
of other points (in 3-d 2^3 = 8
points)
(0,0,c’,q’)
(0,b,c’,q’)
T’(a,b,c’,q’)
(0,0,c,q)

record-at-a-time updates are




(0,b,c,q)
too expensive in terms of I/O
destroys clustering of data points
Kills indexes
main reason for SIR < 2%
(a,0,c’,q’)
(a,0,c,q)
T(a,b,c,q)
(0,0,0,q)
(0,0,0,q+q’)
(a,0,0,q+q’)
(a,0,0,q)
A
(0,b,0,q+q’)
(0,b,0,q)
(a,b,0,q+q’)
(a,b,0,q)
B
 N. Roussopoulos 2007
Database Management Systems
-27-
Future of OLAP & Data Cubing
• The big rush to data warehousing has passed and left bitter taste
 Too costly
 Did not achieve the promised database integration
• New applications with multi-dimensional data are needed
 Cost with today’s technology is much less
 Data integration is not easier and requires hard brain work
• Promising data areas
 Scientific
 Security
 Web
 N. Roussopoulos 2007
Database Management Systems
-28-
Download