Optimal Workload- Based Weighted Wavelet Synopsis

Optimal WorkloadBased Weighted Wavelet Synopsis Yossi Matias Daniel Urieli School of Computer Science Tel Aviv University Outline Motivation  Background & Contributions  Wavelet synopses  Optimal WB weighted wavelet synopses  Outline Motivation  Background & Contributions  Wavelet synopses  Optimal WB weighted wavelet synopses  Approximate Query Processing Operational Database SQL Query Long Response Times! Exact Answer GB/TB Compact Data Synopses KB/MB “Transformed” Query Approximate Answer FAST!! Goals  Develop data synopses  Most accurate answers  Using a small amount of memory  Massive data sets  Time  I/O efficient construction Outline Motivation  Background & Contributions  Wavelet synopses  Optimal WB weighted wavelet synopses  Data synopses  Samples: random samples, stratified samples, congressional samples, reservoir-sampling, backing samples, join synopses, sketches    Histograms: equi-depth, compressed, v-optimal, spline, multidimensional, dynamic, Max-diff, MHIST    [Olken-Rotem, Vitter, Alon-Matias-Szegedy, Gibbons-Matias-Poosala, Acharia et al…] Used in commercial DB systems [Poosala-Ionnidis, etc.] Used in commercial DB systems Wavelets synopses: basic, multi-dim, probabilistic, dynamic, extended  Adapts to nature of data effectively [Matias-Vitter-Wang, Garafolakis-Gibbons, Chakrabarti et al, RousopoulousKiotidis…]  Workload-based wavelet synopses [Matias, Portman]  Accuracy of various synopses Workload-based synopses  Future queries correlated to past queries  Can be thought of as taken from a probability distribution roughly determined by the workload  Workload based synopses: optimized for a given query workload  “Standard” synopses assume uniform workload Workload-based synopses – prior work  Workload-based sampling  Overcoming limitations of sampling for aggregation queries [Chaudhuri, Das, Datar, Motwani, and Narasayya]  Icicles: Self-tuning samples for approximate query answering [Ganti, Lee, Ramakrishnan]  Workload-based histograms     Self-tuning histograms [Aboulnaga and Chaudhuri] ST-holes [ Bruno et al. ] Hierarchical range histogram [Guha-Koudas-Srivastava-02] Workload-based wavelets  By Yossi Matias and Leon Portman Workload-Based Wavelet synopses [MP03]     Adapts effectively to a given query workload (not only to data) Reduces the mean-squared-absolute / relative error over a workload of queries Order magnitude improvement over prior wavelet synopses Not necessarily optimal Contributions  Optimal Workload-based Weighted Wavelet (WWW) synopses   WB-MSE (Workload-Based Mean Squared Error) WB-MRE (Workload-Based Mean-squared Relative Error)  Equivalently, minimize the expected squared, absolute or relative error over a point query  First to minimize the MRE over the data   Both WWW synopses are optimal enhanced wavelet synopses   WB-MRE with uniform distribution A generalized definition which allows coefficients with arbitrary values Optimal cost construction   Linear construction time I/O optimal Techniques  Problem definition in terms of  Weighted norm  Weighted-inner-product  Weighted-inner-product-space  Weighted wavelets for building data synopses Outline Motivation  Background & Contributions  Wavelet synopses  Optimal WB weighted wavelet synopses  Haar wavelet decomposition   Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet basis, easy to understand and implement  Recursive pair wise averaging and differencing at different resolutions.  A linear time algorithm. Resolution 3 2 1 0 Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, [1.5, 4] 4] [2.75] ---[0, -1, -1, 0] [0.5, 0] [-1.25] [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Wavelet error tree [MVW98] 2.75 + -1.25 + 0.5 + + 2 0 2 + 0 - - + -1 -1 - + 2 3 Original data 0 0 - + 5 4 4 The Haar Basis + + + + - - + + -+ 1 -1 0 1 - + - Wavelet error tree [MVW98] How should we choose which coefficients to retain? 2.75 + 0.5 + + 2 -1.25 + 0 2 + - - + -1 -1 - + 0 2 1 1 3 Original data 0 0 - + 5 4 4 Parseval-based optimal thresholding    Given a vector v  R N with respect to some orthonormal basis Goal: approximate the vector using only M << N basis coefficients Then, choosing the largest M coefficients is optimal  Minimizes the L2 norm of the error vector N 1 ei2 E  MSE ( E )   i 0 N Haar Wavelet Synopses - summary  Compute Haar wavelet decomposition of D  Coefficient thresholding: only M<<|D| = N coefficients can be kept  Parseval-based  thresholding optimal w.r.t the MSE  Several other greedy heuristics exists Outline Motivation  Background & Contributions  Wavelet synopses  Optimal WB weighted wavelet synopses  Example Given a synopsis S 3.5 N 1 e(Qi )2 MSE (S )    0.25 N i 0 3.5 i Normalization: 2 N 1 level ( ci ) -0.5 WL2 ( S )   ci  e(Qi ) 2 -0.5 i 0 standard thresholding -1 0 c where 0  c  1 , -0.707 0 -2 0 -1 -1 i 1 0 0 -0.5 0 WL2(S) 2 standard: 2 2 2 2 2 6 6 3 4 5 4 4 4 4 0.498 4 Workload Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Example 3.5 Workload- based thresholding N 1 e(Qi )2 MSE ( S )   1 N i 0 3.5 N 1 -0.5 WL2 ( S )   ci  e(Qi ) 2 -0.5 i 0 c where 0  c  1 , -1 0 -0.707 0 -2 0 -1 -1 i 1 0 0 -0.5 0 WL2(S) 2 2 2 6 3 5 4 4 standard: 2 2 2 6 4 4 4 4 Workload based 2 2 4 4 3 5 4 4 Importance 0.001 Workload 0.001 0.001 0.001 0.249 0.249 0.249 0.249 0.008 Error definition   D = (d1,…,dN) - our data. qi - the point query di  ? ^  di - the approximated answer di  d i | abs-error: e | d  d i | rel-error: ei | i i di The purpose: reduce a norm of E   e1 ,..., eN  ^    ^ For example: ei2 MSE ( E )   n Workload-based Error   A workload: (c1,…,cN), where ci is the probability that qi appears. Given a workload W = (c1,…,cN) we define the Weighted L2 Norm: WL2 ( E )  N 2 c e ii i 1 for E  (e1 ,..., eN )  When ci = 1/N: WL2(E) = MSE Our goal  Minimizing the WL2 norm of the errors vector E  For  given data set D and query workloads W Equivalently: minimizing the expected squared error over a point query taken from a given distribution Regular Haar transform Given a data set D = (d0,…,dN-1) D HT(D) standard thresholding wavelet synopsis Haar Transform (HT) Overview Given a data set D = (d0,…,dN-1) and a workload vector W = (c0,…,cN-1) W D WHT(D) standard thresholding WB – wavelet synopsis WHB(W) Weighted Haar Basis (WHB( Weighted Haar Transform (WHT) Parseval’s formula, the WL2 norm, the weighted inner product, and the algorithm for computing the WH basis from the workload The weighted Haar basis  The Weighted Haar Basis would also look like but x -y 0 1 Compute the Weighted Haar Basis  Meaning it would look more like: Recall the weight coefficients (the workload) W = (c0,…,cN-1) for D = (d0,…,dN-1) 0 1 c0,c1,… , cN-1 Experimental results WB-MSE VS. STANDARD Experimental results WB-MRE, ADAPTIVE, STANDARD Experimental results WB-MRE, ADAPTIVE Thank you!

Optimal Workload- Based Weighted Wavelet Synopsis

Related documents

Products

Support

Optimal Workload- Based Weighted Wavelet Synopsis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib