Optimal WorkloadBased Weighted Wavelet Synopsis Yossi Matias Daniel Urieli School of Computer Science Tel Aviv University Outline Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses Outline Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses Approximate Query Processing Operational Database SQL Query Long Response Times! Exact Answer GB/TB Compact Data Synopses KB/MB “Transformed” Query Approximate Answer FAST!! Goals Develop data synopses Most accurate answers Using a small amount of memory Massive data sets Time I/O efficient construction Outline Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses Data synopses Samples: random samples, stratified samples, congressional samples, reservoir-sampling, backing samples, join synopses, sketches Histograms: equi-depth, compressed, v-optimal, spline, multidimensional, dynamic, Max-diff, MHIST [Olken-Rotem, Vitter, Alon-Matias-Szegedy, Gibbons-Matias-Poosala, Acharia et al…] Used in commercial DB systems [Poosala-Ionnidis, etc.] Used in commercial DB systems Wavelets synopses: basic, multi-dim, probabilistic, dynamic, extended Adapts to nature of data effectively [Matias-Vitter-Wang, Garafolakis-Gibbons, Chakrabarti et al, RousopoulousKiotidis…] Workload-based wavelet synopses [Matias, Portman] Accuracy of various synopses Workload-based synopses Future queries correlated to past queries Can be thought of as taken from a probability distribution roughly determined by the workload Workload based synopses: optimized for a given query workload “Standard” synopses assume uniform workload Workload-based synopses – prior work Workload-based sampling Overcoming limitations of sampling for aggregation queries [Chaudhuri, Das, Datar, Motwani, and Narasayya] Icicles: Self-tuning samples for approximate query answering [Ganti, Lee, Ramakrishnan] Workload-based histograms Self-tuning histograms [Aboulnaga and Chaudhuri] ST-holes [ Bruno et al. ] Hierarchical range histogram [Guha-Koudas-Srivastava-02] Workload-based wavelets By Yossi Matias and Leon Portman Workload-Based Wavelet synopses [MP03] Adapts effectively to a given query workload (not only to data) Reduces the mean-squared-absolute / relative error over a workload of queries Order magnitude improvement over prior wavelet synopses Not necessarily optimal Contributions Optimal Workload-based Weighted Wavelet (WWW) synopses WB-MSE (Workload-Based Mean Squared Error) WB-MRE (Workload-Based Mean-squared Relative Error) Equivalently, minimize the expected squared, absolute or relative error over a point query First to minimize the MRE over the data Both WWW synopses are optimal enhanced wavelet synopses WB-MRE with uniform distribution A generalized definition which allows coefficients with arbitrary values Optimal cost construction Linear construction time I/O optimal Techniques Problem definition in terms of Weighted norm Weighted-inner-product Weighted-inner-product-space Weighted wavelets for building data synopses Outline Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses Haar wavelet decomposition Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet basis, easy to understand and implement Recursive pair wise averaging and differencing at different resolutions. A linear time algorithm. Resolution 3 2 1 0 Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, [1.5, 4] 4] [2.75] ---[0, -1, -1, 0] [0.5, 0] [-1.25] [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Wavelet error tree [MVW98] 2.75 + -1.25 + 0.5 + + 2 0 2 + 0 - - + -1 -1 - + 2 3 Original data 0 0 - + 5 4 4 The Haar Basis + + + + - - + + -+ 1 -1 0 1 - + - Wavelet error tree [MVW98] How should we choose which coefficients to retain? 2.75 + 0.5 + + 2 -1.25 + 0 2 + - - + -1 -1 - + 0 2 1 1 3 Original data 0 0 - + 5 4 4 Parseval-based optimal thresholding Given a vector v R N with respect to some orthonormal basis Goal: approximate the vector using only M << N basis coefficients Then, choosing the largest M coefficients is optimal Minimizes the L2 norm of the error vector N 1 ei2 E MSE ( E ) i 0 N Haar Wavelet Synopses - summary Compute Haar wavelet decomposition of D Coefficient thresholding: only M<<|D| = N coefficients can be kept Parseval-based thresholding optimal w.r.t the MSE Several other greedy heuristics exists Outline Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses Example Given a synopsis S 3.5 N 1 e(Qi )2 MSE (S ) 0.25 N i 0 3.5 i Normalization: 2 N 1 level ( ci ) -0.5 WL2 ( S ) ci e(Qi ) 2 -0.5 i 0 standard thresholding -1 0 c where 0 c 1 , -0.707 0 -2 0 -1 -1 i 1 0 0 -0.5 0 WL2(S) 2 standard: 2 2 2 2 2 6 6 3 4 5 4 4 4 4 0.498 4 Workload Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Example 3.5 Workload- based thresholding N 1 e(Qi )2 MSE ( S ) 1 N i 0 3.5 N 1 -0.5 WL2 ( S ) ci e(Qi ) 2 -0.5 i 0 c where 0 c 1 , -1 0 -0.707 0 -2 0 -1 -1 i 1 0 0 -0.5 0 WL2(S) 2 2 2 6 3 5 4 4 standard: 2 2 2 6 4 4 4 4 Workload based 2 2 4 4 3 5 4 4 Importance 0.001 Workload 0.001 0.001 0.001 0.249 0.249 0.249 0.249 0.008 Error definition D = (d1,…,dN) - our data. qi - the point query di ? ^ di - the approximated answer di d i | abs-error: e | d d i | rel-error: ei | i i di The purpose: reduce a norm of E e1 ,..., eN ^ ^ For example: ei2 MSE ( E ) n Workload-based Error A workload: (c1,…,cN), where ci is the probability that qi appears. Given a workload W = (c1,…,cN) we define the Weighted L2 Norm: WL2 ( E ) N 2 c e ii i 1 for E (e1 ,..., eN ) When ci = 1/N: WL2(E) = MSE Our goal Minimizing the WL2 norm of the errors vector E For given data set D and query workloads W Equivalently: minimizing the expected squared error over a point query taken from a given distribution Regular Haar transform Given a data set D = (d0,…,dN-1) D HT(D) standard thresholding wavelet synopsis Haar Transform (HT) Overview Given a data set D = (d0,…,dN-1) and a workload vector W = (c0,…,cN-1) W D WHT(D) standard thresholding WB – wavelet synopsis WHB(W) Weighted Haar Basis (WHB( Weighted Haar Transform (WHT) Parseval’s formula, the WL2 norm, the weighted inner product, and the algorithm for computing the WH basis from the workload The weighted Haar basis The Weighted Haar Basis would also look like but x -y 0 1 Compute the Weighted Haar Basis Meaning it would look more like: Recall the weight coefficients (the workload) W = (c0,…,cN-1) for D = (d0,…,dN-1) 0 1 c0,c1,… , cN-1 Experimental results WB-MSE VS. STANDARD Experimental results WB-MRE, ADAPTIVE, STANDARD Experimental results WB-MRE, ADAPTIVE Thank you!