Optimal Workload- Based Weighted Wavelet Synopsis

advertisement
Optimal WorkloadBased Weighted
Wavelet Synopsis
Yossi Matias
Daniel Urieli
School of Computer Science
Tel Aviv University
Outline
Motivation
 Background & Contributions
 Wavelet synopses
 Optimal WB weighted wavelet synopses

Outline
Motivation
 Background & Contributions
 Wavelet synopses
 Optimal WB weighted wavelet synopses

Approximate Query Processing
Operational
Database
SQL Query
Long Response Times!
Exact Answer
GB/TB
Compact Data
Synopses
KB/MB
“Transformed” Query
Approximate Answer
FAST!!
Goals

Develop data synopses
 Most
accurate answers
 Using a small amount of memory

Massive data sets
 Time
 I/O
efficient construction
Outline
Motivation
 Background & Contributions
 Wavelet synopses
 Optimal WB weighted wavelet synopses

Data synopses

Samples: random samples, stratified samples, congressional
samples, reservoir-sampling, backing samples, join synopses,
sketches



Histograms: equi-depth, compressed, v-optimal, spline, multidimensional, dynamic, Max-diff, MHIST



[Olken-Rotem, Vitter, Alon-Matias-Szegedy, Gibbons-Matias-Poosala,
Acharia et al…]
Used in commercial DB systems
[Poosala-Ionnidis, etc.]
Used in commercial DB systems
Wavelets synopses: basic, multi-dim, probabilistic, dynamic,
extended

Adapts to nature of data effectively
[Matias-Vitter-Wang, Garafolakis-Gibbons, Chakrabarti et al, RousopoulousKiotidis…]

Workload-based wavelet synopses [Matias, Portman]

Accuracy of various synopses
Workload-based synopses

Future queries correlated to past queries
 Can
be thought of as taken from a probability
distribution roughly determined by the workload

Workload based synopses: optimized for a given
query workload
 “Standard”
synopses assume uniform workload
Workload-based synopses – prior work

Workload-based sampling

Overcoming limitations of sampling for aggregation queries
[Chaudhuri, Das, Datar, Motwani, and Narasayya]

Icicles: Self-tuning samples for approximate query answering
[Ganti, Lee, Ramakrishnan]

Workload-based histograms




Self-tuning histograms [Aboulnaga and Chaudhuri]
ST-holes [ Bruno et al. ]
Hierarchical range histogram [Guha-Koudas-Srivastava-02]
Workload-based wavelets

By Yossi Matias and Leon Portman
Workload-Based Wavelet synopses
[MP03]




Adapts effectively to a given query workload (not only to
data)
Reduces the mean-squared-absolute / relative error over
a workload of queries
Order magnitude improvement over prior wavelet
synopses
Not necessarily optimal
Contributions

Optimal Workload-based Weighted Wavelet (WWW) synopses


WB-MSE (Workload-Based Mean Squared Error)
WB-MRE (Workload-Based Mean-squared Relative Error)

Equivalently, minimize the expected squared, absolute or
relative error over a point query

First to minimize the MRE over the data


Both WWW synopses are optimal enhanced wavelet synopses


WB-MRE with uniform distribution
A generalized definition which allows coefficients with arbitrary values
Optimal cost construction


Linear construction time
I/O optimal
Techniques

Problem definition in terms of
 Weighted
norm
 Weighted-inner-product
 Weighted-inner-product-space

Weighted wavelets for building data synopses
Outline
Motivation
 Background & Contributions
 Wavelet synopses
 Optimal WB weighted wavelet synopses

Haar wavelet decomposition


Wavelets: mathematical tool for hierarchical decomposition
of functions/signals
Haar wavelets: simplest wavelet basis, easy to understand
and implement

Recursive pair wise averaging and differencing at different resolutions.
 A linear time algorithm.
Resolution
3
2
1
0
Averages
Detail Coefficients
[2, 2, 0, 2, 3, 5, 4, 4]
[2,
1,
4,
[1.5,
4]
4]
[2.75]
---[0, -1, -1, 0]
[0.5, 0]
[-1.25]
[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
Wavelet error tree [MVW98]
2.75
+
-1.25
+
0.5
+
+
2
0
2
+
0
-
-
+
-1
-1
- +
2
3
Original data
0
0
- +
5
4
4
The Haar Basis
+
+
+
+
-
- +
+
-+
1
-1
0
1
- +
-
Wavelet error tree [MVW98]
How should we
choose which
coefficients to
retain?
2.75
+
0.5
+
+
2
-1.25
+
0
2
+
-
-
+
-1
-1
- +
0
2
1
1
3
Original data
0
0
- +
5
4
4
Parseval-based optimal thresholding



Given a vector v  R N with respect to some
orthonormal basis
Goal: approximate the vector using only M << N basis
coefficients
Then, choosing the largest M coefficients is optimal

Minimizes the L2 norm of the error vector
N 1
ei2
E  MSE ( E )  
i 0 N
Haar Wavelet Synopses - summary

Compute Haar wavelet decomposition of D

Coefficient thresholding: only M<<|D| = N
coefficients can be kept
 Parseval-based

thresholding
optimal w.r.t the MSE
 Several
other greedy heuristics exists
Outline
Motivation
 Background & Contributions
 Wavelet synopses
 Optimal WB weighted wavelet synopses

Example
Given a synopsis S
3.5
N 1
e(Qi )2
MSE (S )  
 0.25
N
i 0
3.5
i
Normalization:
2
N 1
level ( ci )
-0.5
WL2 ( S )   ci  e(Qi ) 2
-0.5
i 0
standard thresholding
-1
0
c
where 0  c  1 ,
-0.707
0
-2
0
-1
-1
i
1
0
0
-0.5
0
WL2(S)
2
standard:
2
2
2
2
2
6
6
3
4
5
4
4
4
4
0.498
4
Workload
Importance:
0.001
0.001
0.001
0.001
0.249
0.249
0.249
0.249
Example
3.5
Workload- based
thresholding
N 1
e(Qi )2
MSE ( S )  
1
N
i 0
3.5
N 1
-0.5
WL2 ( S )   ci  e(Qi ) 2
-0.5
i 0
c
where 0  c  1 ,
-1
0
-0.707
0
-2
0
-1
-1
i
1
0
0
-0.5
0
WL2(S)
2
2
2
6
3
5
4
4
standard:
2
2
2
6
4
4
4
4
Workload
based
2
2
4
4
3
5
4
4
Importance
0.001
Workload
0.001
0.001
0.001
0.249
0.249
0.249
0.249
0.008
Error definition


D = (d1,…,dN) - our data.
qi - the point query di  ?
^

di
- the approximated answer
di  d i
|
abs-error: e | d  d i | rel-error: ei |
i
i
di
The purpose: reduce a norm of E   e1 ,..., eN 
^



^
For example:
ei2
MSE ( E )  
n
Workload-based Error


A workload: (c1,…,cN), where ci is the
probability that qi appears.
Given a workload W = (c1,…,cN) we define the
Weighted L2 Norm:
WL2 ( E ) 
N
2
c
e
ii
i 1
for E  (e1 ,..., eN )

When ci = 1/N: WL2(E) = MSE
Our goal

Minimizing the WL2 norm of the errors vector E
 For

given data set D and query workloads W
Equivalently: minimizing the expected squared
error over a point query taken from a given
distribution
Regular Haar transform
Given a data set D = (d0,…,dN-1)
D
HT(D)
standard
thresholding
wavelet
synopsis
Haar
Transform
(HT)
Overview
Given a data set D = (d0,…,dN-1) and a workload vector W = (c0,…,cN-1)
W
D
WHT(D)
standard
thresholding
WB – wavelet
synopsis
WHB(W)
Weighted
Haar Basis
(WHB(
Weighted Haar
Transform
(WHT)
Parseval’s formula, the
WL2 norm, the weighted
inner product, and the
algorithm for computing
the WH basis from the
workload
The weighted Haar basis

The Weighted Haar Basis would also look
like
but
x
-y
0
1
Compute the Weighted Haar Basis

Meaning it would look more like:
Recall the weight
coefficients (the
workload)
W = (c0,…,cN-1) for
D = (d0,…,dN-1)
0
1
c0,c1,… , cN-1
Experimental results
WB-MSE VS. STANDARD
Experimental results
WB-MRE, ADAPTIVE, STANDARD
Experimental results
WB-MRE, ADAPTIVE
Thank you!
Download