Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University 1 About this Tutorial Information is incomplete and could be inaccurate Our presentation reflects our understanding which may be erroneous A tutorial on synopsis construction algorithms VLDB 2005 2 Synopses Construction Where is the life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? T. S. Eliot, from The Rock. Routers Sensors Web Astronomy and sciences Too much data too little time. A tutorial on synopsis construction algorithms VLDB 2005 3 The idea To see the world in a grain of sand… Broad characteristics of the data Compression Dimensionality Reduction Approximate query answering Denoising, Outlier Detection and a broad array of signal processing A tutorial on synopsis construction algorithms VLDB 2005 4 What is a synopsis ? Hmm. Any “shorthand” representation Clustering! SVD! In this tutorial we will focus on signal/time series processing A tutorial on synopsis construction algorithms VLDB 2005 5 The basic problem Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which is a fn of X-F Note, the above extends to any dim. A tutorial on synopsis construction algorithms VLDB 2005 6 Many issues What is the dictionary ? Which B terms ? What is the error ? What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 7 Many issues What is the dictionary ? Set of vectors Maybe a basis Top K Which B terms ? What is the error ? What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 8 Many issues What is the dictionary ? Set of vectors Maybe a basis Haar Wavelets Which B terms ? Also Fourier, Polynomials,… What is the error ? What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 9 Many issues What is the dictionary ? Set of vectors May not be a basis Histograms: There are n choose 2 vectors But since we impose a non-overlapping restriction we get a unique representation. Which B terms ? What is the error ? What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 10 Many issues What is the dictionary ? Which B terms ? First B ? Best B ? Why should we choose first B ? 1. B vs 2B numbers 2. Also … What is the error ? What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 11 Approximation theory Discipline of Math associated with approximation of functions. Same as our problem Linear theory (Parseval, 1800 over two centuries) Non-Linear theory (Schmidt 1909, Haar 1910) Is it relevant ? Yes. However Math treatment has been “extremal”, i.e., how does the error change as a function of B. Is that bound tight? Note: a yes answer does not say anything about “given this signal, is that the best we can do ?” A tutorial on synopsis construction algorithms VLDB 2005 12 Many issues What is the dictionary ? Which B terms ? What is the error ? This controls which B. ||X-F||2 is most common, used all over in mathematics ||X-F||1,||X-F||1 are useful also Weights. Relative error of approximation 1000 by 1010 is not so bad. 1 by 11 is not too good an idea. What are the constraints ? A tutorial on synopsis construction algorithms VLDB 2005 13 Many issues What is the dictionary ? Which B terms ? What is the error ? What are the constraints ? Input ? Stream, stream of updates … Space, time, precision and range of values (for zi in the expression F=i zi i ) A tutorial on synopsis construction algorithms VLDB 2005 14 In this tutorial Histograms & Wavelets Will focus on Optimal, Approximation and Streaming algorithms How to get one from the other! Connections to top K and Fourier. A tutorial on synopsis construction algorithms VLDB 2005 15 I. Histograms. A tutorial on synopsis construction algorithms VLDB 2005 16 VOpt Histograms Lets start simple Given a signal X, find a piecewise constant representation H with at most B pieces minimizing ||XH||2 Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, 1998 Consider one bucket. The mean is the best value. A natural Dynamic programming formulation A tutorial on synopsis construction algorithms VLDB 2005 17 An Example Histogram Data Distribution Location (i) 1 2 3 4 5 6 7 Value (Xi) 12 10 2 8 14 28 16 V-Optimal Histogram Range Representative A tutorial on synopsis construction algorithms [1,4] [5,5] [6,6] [7,7] 8 14 28 16 VLDB 2005 18 Idea: VOpt Algorithm Within “step/bucket”: Mean is the best. Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket j 1 j 1 n Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms VLDB 2005 19 Idea: VOpt Algorithm Within “step/bucket”: Mean is the best. Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket 1 j j 1 n Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms VLDB 2005 20 Idea: VOpt Algorithm Within “step/bucket”: Mean is the best. Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket 1 j j 1 n Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms VLDB 2005 21 Idea: VOpt Algorithm Dynamic programming algorithm was given to construct the V optimal Histogram. OPT[n,k] = min {OPT[j,k-1,]+SQERR[(j+1)..n]} 1≤j<n OPT[j, k] : the minimum cost of representing the set of values indexed by [1..j] by a histogram with k buckets. SQERR[(j+1)..n]: the sum of the squared absolute errors from (j+1) to n. A tutorial on synopsis construction algorithms VLDB 2005 22 The DP-based VOpt Algorithm for i=1 to n do for k=1 to B do for j=1 to i-1 do (split pt of k-1 bucket hist. and last bucket) OPT[i, k] = min{ OPT[i, k], OPT[j,k-1] + SQERR[j+1,i] } OPT B n We need O(Bn) entries for the table OPT For each entry OPT[i,k], it takes O(n) time if SQERR[j+1.i] can be computed O(1) time O(Bn) space and O(Bn2) time A tutorial on synopsis construction algorithms VLDB 2005 23 Computation of Sum of Squared Absolute Error in O(1) time index 1 2 3 4 x 2 3 7 5 sum 2 5 12 17 sum(2,3) = x[2]+x[3] = sum[3]-sum[1]= 12-2 = 10 A tutorial on synopsis construction algorithms VLDB 2005 24 Computation of Sum of Squared Absolute Error in O(1) time i Let SQSUM[1, i] x p 1 Then, i and SUM[1, i] x p 2 p p 1 j SQSUM(i, j ) x 2p SQSUM[ j ] SQSUM[i 1] p i j SUM (i, j ) x p SUM[ j ] SUM[i 1] p i Thus, SQERR[i, j ] ( x x) 2 x 2 p p j j p i p i j 1 ( x p ) 2 j i 1 p i 1 ( SQSUM[1, i] SQSUM[1, j 1]) ( SUM[1, i] SUM[1, j 1])2 j i 1 A tutorial on synopsis construction algorithms VLDB 2005 25 Analysis of VOpt Algorithm O(n2B) time O(nB) space The space can be reduced (Wednesday) Main Question : The end use of histogram is to approximate something. Why not find an “approximately optimal” (e.g., (1+ε) ) histogram? A tutorial on synopsis construction algorithms VLDB 2005 26 If you had to improve something ? O(n2B) time O(nB) space Via Wavelets ssq O(n) time O(B2/2) space (1+) streaming O(nB2/) time. O(B2/) space O(n2B) time O(n) space (1+) streaming O(n) time. O(B2/) space A tutorial on synopsis construction algorithms offline O(n) time. O(B2/) space (1+) streaming ssq O(n) time. O(B/2) space Offline O(n) time. O(n+B/) space VLDB 2005 27 Take 1: For i=1 to n do For K=1 to B do For j=1 to i-1 do (split point for the last bucket) OPT[ 1…i, k] = Min [ OPT[1…i, k], OPT[1…j,k-1]+ SQERR(j+1,i) ] OPT[1..j,k] is increasing SQERR(j+1,i) is decreasing As j increases Question: Can we use the monotonicity for searching the minimum ? A tutorial on synopsis construction algorithms VLDB 2005 28 No Consider a sequence of positive y1,y2,…,yn F(i) = i yi and G(i) = F(n) – F(i-1) F(i): monotonically increasing … Opt[1..j,k-1] G(i): monotonically deceasing … SQERR(j+1,i) (n) time is necessary to find mini{ F(i)+G(i) } Open Question: Does it extend to (n2) over the entire algorithm ? A tutorial on synopsis construction algorithms VLDB 2005 29 What gives ? Consider a sequence of positive y1,y2,…,yn F(i) = i yi and G(i) = F(n) – F(i-1) Thus, F(i)+G(i) = F(i) + xi Any i gives a 2 approximation to mini{ F(i) + G(i)} F(i) + G(i) = F(n) + xi ≤ 2 F(n) mini{ F(i) + G(i)} is at least F(n) A tutorial on synopsis construction algorithms VLDB 2005 30 Round 1 Use a histogram to approximate the fn Bootstrap! Approximate the increasing fn in powers of (1+d) Right end pt is (1+d) approximation of left end pt ·(1+d)h h A tutorial on synopsis construction algorithms VLDB 2005 31 What does that do ? Consider evaluating the fn at the two endpoints Proof by picture. (1+d) ¸ h h’ Why ? By construction. Why ? By monotonicity! ¸ A tutorial on synopsis construction algorithms VLDB 2005 32 Therefore… The right hand point is a (1+δ) approximation! Holds for any point in between. SQERR OPT OPT[x]+SQERR[x+1]≥ OPT[a]+SQERR[b] ≥ OPT[b]/(1+ δ) + SQERR[b] ≥ {OPT[b] + SQERR[b]}/ (1+δ) Are we done ? Not quite yet. What happens for B>2 ? – we do not h’ a b compute OPT[i,b] exactly !! A tutorial on synopsis construction algorithms VLDB 2005 33 Zen and the art of histograms Approximate the increasing fn in powers of (1+d) Right end pt is (1+d) approximation Prove by induction that the error is (1+d)B This tells us what d should be (small), in fact if we set d=/2B then (1+d)B· 1+ A tutorial on synopsis construction algorithms VLDB 2005 34 Complexity analysis # of intervals p ~ (B/) log n Why ? c(1+δ) (p-1) ≤ nR2 and δ = /(2B) R is the largest number in data Assume R is polynomially bounded by n Running time ~ nB (B/) log n Why are we approximating the increasing function ? Why not the decreasing one ? A tutorial on synopsis construction algorithms VLDB 2005 35 The first streaming model The signal X is specified by xi arriving in increasing order of i Not the most general model But extremely useful for modeling time series data A tutorial on synopsis construction algorithms VLDB 2005 36 Streaming 1b xi 1b x2i Need to store 1a xi 1a x2i a b Required space is (B2/) log n A tutorial on synopsis construction algorithms VLDB 2005 37 VOpt Construction: O(Bn2) [Jagadish et al.: VLDB 1998] OPT(i,k) = min1≤j<i{OPT(j,k-1)+SQERR(j+1,i)} OPT[j,k] 10 7 8 9 3 4 5 6 1 2 OPT[j,k-1] 8 9 10 1 2 3 4 5 6 7 n A tutorial on synopsis construction algorithms n VLDB 2005 38 AHIST-S: (1+ε) Approximation AOPT[j,k] = min1≤j<i{AOPT[bjp,k-1]+SQERR[bjp+1,n]} O(B2ε-1nlogn) time and O(B2ε-1logn) space AOPT[j,k] δ = ε /2B (1+δ)a ≥b AOPT[j,k-1] bc a (1+δ)a < c A tutorial on synopsis construction algorithms n P VLDB 2005 P = O(Bε-1logn) 39 The overall idea The natural DP table The approximate table A tutorial on synopsis construction algorithms VLDB 2005 40 Do s talk to us ? DJIA data from 1901-1993 execution time B A tutorial on synopsis construction algorithms VLDB 2005 41 Take 2: GK02 Sliding window streams Potentially infinite data – interested in the last n only Q: Suppose we constructed histogram for [1..n] and now want it for [2..(n+1)] Previous idea is a dead on arrival. Consider 100,1,2,3,4,5,7,8,… A tutorial on synopsis construction algorithms VLDB 2005 42 Formal problem Maintain a data structure Given an interval [a,b] construct a B bucket histogram for [a,b] Compute on the fly Generalizes the window! Generalizes VOpt when a=1,b=n A tutorial on synopsis construction algorithms VLDB 2005 43 Reconsider the take 1 We are evaluating Left to right, i.e., But we are still evaluating this guy ! A tutorial on synopsis construction algorithms VLDB 2005 44 A brave new world Assume a O(n) size buffer holds xi values The previous algorithm was: Several issues 1. Which values are necessary and sufficient 2. We are not evaluating all values – what induction ? A tutorial on synopsis construction algorithms VLDB 2005 45 GK02: Enhanced (1+ε) Approximation Lazy evaluation using Binary Search O(B3ε-2log3n) time and O(n) space Pre-processing takes O(n) time – SUM and SQSUM (1+δ)a ≥z AOPT[j,k] b a (1+δ)a < z+1 AOPT[j,k-1] P A tutorial on synopsis construction algorithms P = O(Bε-1logn) n VLDB 2005 47 GK02: Enhanced (1+ε) Approximation Creates all of B interval lists at once The values of necessary AOPT[j,k] are computed recursively to find the intervals [ajp, bjp] where bjp is the largest z s.t. (1+ε) AOPT[ajp,k] ≥ (1+ε) AOPT[z,k] (1+ε) AOPT[ajp,k] < (1+ε) AOPT[z+1,k] Note that AOPT increases as z increases Thus, we can use binary search to find z O(n) space of SUM and SQSUM arrays needs to be maintain to allow the computation of SQERR(j+1,i) in O(1) time O(n+B3ε-2log3n) time and O(n) space A tutorial on synopsis construction algorithms VLDB 2005 48 Take 2 summary O(n) space and O(n+B3-2log2 n) time Is that the best ? Obviously no. A tutorial on synopsis construction algorithms VLDB 2005 49 Take 3: AHIST-L- Suppose we knew · OPT · 2 then… Instead of powers of (1+/B) additive terms of /(2B) then … Time is O(B3-2 log n) O(B/) To get ? 2-approximation: =O(1) a binary search: O(log n) Thus, O(B3 log n * log n) Overall O(n+B3(-2+logn)log n) time and O(n+B2/) space: A tutorial on synopsis construction algorithms VLDB 2005 50 Take 4: AHIST-B Consider the take 4 algorithm. How to stream it ? On the new part Overall A tutorial on synopsis construction algorithms M VLDB 2005 51 Not done yet k K-1 1+r First find an =O(1) approximation, then proceed back and refine A tutorial on synopsis construction algorithms VLDB 2005 52 The running space-time B(# insertions)(log M)(log ) where =O(B-1 log n) is the length of a list Space Who cares and why ? A tutorial on synopsis construction algorithms VLDB 2005 53 Asymptotics For fixed B and , we can compute a (1+ ) piecewise constant representation in O(n log log n) time and O(log n) space or O(n) time and O(log n log log n) space. Extends to degree d polynomials, space increases by O(d) and time is O(nd + d3…) A tutorial on synopsis construction algorithms VLDB 2005 54 Execution Time Our friendly : Running time B A tutorial on synopsis construction algorithms VLDB 2005 55 (Error –VOPT)/VOPT Our friendly : Error B A tutorial on synopsis construction algorithms VLDB 2005 56 Execution time What you analyze is what you get A tutorial on synopsis construction algorithms n VLDB 2005 57 Questions ? A tutorial on synopsis construction algorithms VLDB 2005 58 For general error measure, IF… The error of a bucket only depends on the values in the bucket. The overall error function, is the sum of the errors in the buckets. The data can be processed in O(T) time per item such that in O(Q) time we can find the error of a bucket, storing O(P) info. The error (of a bucket) is a monotonic function of the interval. The value of the maximum and the minimum nonzero error is polynomially bounded in n. A tutorial on synopsis construction algorithms VLDB 2005 60 Then… Optimum histogram in time O(nT+n2(B+Q)) time and O(n(P+B)) space (1+)-approximation in O(nT+nQB2-1 log n) time and O(PB2-1 log n) space, O(nT + QB3(log n + -2 )log n) time and O(nP) space O(nT) time and space O(PB2 -1 log n + (QB/T) A tutorial on synopsis construction algorithms [B -1 log2 (B-1 log n) + log n loglog n) VLDB 2005 ] 61 Splines and piecewise polynomials Instead of If we wanted Or maybe… A tutorial on synopsis construction algorithms VLDB 2005 62 The overall idea If we want to represent {xa+1,…,xb} by p0+p1(x-xa)+p2(x-xa)2 + … The solution is as above… We need O(d) times (than before) space and need to solve the system. This means an increase by a factor O(d3) in time. A tutorial on synopsis construction algorithms VLDB 2005 63 Another useful example: Relative error Issue with global measures: Estimating 10 by 20 and 1000 by 1010 has the same effect The above is ok if we are querying for “1000” a 1000 times and 10 times for “10” (point queries and VOPT measure) But consider approximating a time series. We may be interested in per point guarantees. A tutorial on synopsis construction algorithms VLDB 2005 64 Sum of Squared Relative Error for a Bucket Relative error for a bucket (sr,er,xr) : ( xi xr ) 2 2 ERRSQ (sr , er ) min{ } Ax r 2 Bxr C 2 2 x i sr max{c , x } i er Since A > 0, it is minimized when xr=B/A The minimum value is C-B2/A If the aggregated sum of A, B and C are stored, ERRSQ(i,j) can be computed in O(1) time Optimal histogram can be constructed in O(Bn2) time… Approximation algorithms follow… A tutorial on synopsis construction algorithms VLDB 2005 65 Maximum Error and the l1 metric A tutorial on synopsis construction algorithms VLDB 2005 66 Maximum Error Histograms A bucket (sr,er,xr) with a numbers {x1, x2, …, xn} s.t. sr: starting position er: ending position xr: representative value Maximum Error is given by ERR M ( sr , er ) min max | xi xr | xr i[ sr ,er ] Maximum relative error is defined as: | xi xr | ERRM ( sr , er ) min{ max } xr i[ sr ,er ] max{c, | x |} i A tutorial on synopsis construction algorithms VLDB 2005 67 Maximum Error of a bucket Given numbers {x1, x2, …, xn} s.t. Maximum Error is given by ErrM=minxr maxi |xi – xr| What is the best xr (xmin+xmax)/2 A tutorial on synopsis construction algorithms VLDB 2005 68 Maximum Relative Error of a set Given a set of numbers {x1, x2, …, xn} max: the maximum of {x1, x2, …, xn} min: the minimum of {x1, x2, …, xn} c: A sanitary constant Some function of c,max,min E.g., when c· min· max the error is Optimal maximum relative error for a bucket can be computed in O(1) time A tutorial on synopsis construction algorithms VLDB 2005 69 The Naïve Optimal Algorithm for i :=1 to n do OPTM[i,1] := ERRM(i,i) for K :=1 to B do { max := - ∞; min := ∞; OPTM[i,k] := ∞ for j :=i-1 to 1 do { if (max < x[j+1]) max := x[j+1] if (min > x[j+1]) min := x[j+1] OPTM[i,k] := min{OPTM[i,k] , max( OPTM[j,k-1], ERRM(j+1,i) ) } } } } ERRM(j+1,i) can be obtained in O(1) time O(Bn) space and O(Bn2 ) time optimal algorithm A tutorial on synopsis construction algorithms VLDB 2005 70 An Improved Optimal Algorithm OPTM[i,j] := minj{max( OPTM[j,k-1], ERRM(j+1,i)) } Observations OPTM[j,k-1] is an increasing function ERRM(j+1,i) is a decreasing function To compute minx{ max ( F(x), G(x) ) } where F(x) and G(x) are non-decreasing and non-increasing functions We can perform binary search for the value of x such that F(x) > G(x) and F(x-1) < G(x-1) The minimum is min{ G(x-1) and F(X) } A tutorial on synopsis construction algorithms VLDB 2005 71 An Improved Optimal Algorithm OPTM[i,j]:= min{max(OPTMj,k-1], ERRM(j+1,i))} We can improve the most inner loop of Naïve algorithm in O(log n) time. However, ERRM(j+1,i) cannot be computed in O(1) time any more Using an interval tree, we can compute min and max values for [j+1, i], i.e. ERRM(j+1,i), in O(log n) time Thus, our improved algorithm takes O(Bn log2n) time with O(Bn) space A tutorial on synopsis construction algorithms VLDB 2005 72 An Interval Tree Example [2,4] [1,8] Min Interval [1,4] decomposeRight decomposeLeft [1,2] [1,1] [2,2] [5,8] [3,4] [3,3] [4,4] [5,6] [5,5] [6,6] [7,8] [7,7] [8,8] The steps of decomposing [2,4] with an interval tree A tutorial on synopsis construction algorithms VLDB 2005 73 Consider another solution Make the first bucket as large as possible i.e. push the boundary right E.g. in the figure we can…. As long as the max and min is same… Why will we have to stop ? A tutorial on synopsis construction algorithms VLDB 2005 74 Consider another solution (2) In this example we cannot… But may be the error comes from a different bucket! Here’s one idea Given an i, find Err[1,i] If i is small Err[1,i] · OPT If i is large Err[1,i] ¸ OPT How ? A tutorial on synopsis construction algorithms By binary search ! Observe that given an error , it is easy to check if the error can be realized by B buckets VLDB 2005 75 How ? Assume given an interval [a,b], we can find the min and max, and therefore Err[a,b] With O(n) time and space preprocessing, we can find Err[] in O(log n) time. (interval tree) Check[p,q,b,]: If q > p (for b¸ 0), we are done. Otherwise, Find mid, s.t. Err[p,mid] · and Err[p,mid+1] > Check[mid+1,q,b-1,] O(B log2 n) Binary Search: log n * log n (to find min and max for Err) Invocation of Check: B times A tutorial on synopsis construction algorithms VLDB 2005 76 Now for the original problem By binary search, find largest s such that When =Err[1,s] and ’=Err[1,s+1], Check[1,n, B-1 ]=false and Check[1,n, B-1, ’]=true Now OPT=’ or the best B-1 bucket error of [s+1,n] A recursive algorithm! T(B)= log n * B log2 n + T(B-1) ¼ O(B2 log3 n) !! Check[] A tutorial on synopsis construction algorithms VLDB 2005 77 Summary In O(n + B2 log3 n) time and O(n) space we can find the optimum error. What do we do if Stream or Less than O(n) space ? Approximate, using some of the old ideas… A tutorial on synopsis construction algorithms VLDB 2005 78 Short break ! When we return •Range Query Histograms •Wavelets • Optimum synopsis • Connection to Histograms •Overall ideas and themes 79 Range Query Histograms A tutorial on synopsis construction algorithms VLDB 2005 80 A more synopsis structure Instead of estimating the value at a point we are interested in sum of the values in intervals/ranges. Clearly, very useful. Clearly we need new optimization. E.g., A tutorial on synopsis construction algorithms Not useful, in this example VLDB 2005 81 A more difficult problem Only special cases solved (satisfactorily) Hierarchies: Prefix ranges: All ranges of form [1,j] as j varies Complete Binary Ranges General hierarchies Uniform Ranges: all A tutorial on synopsis construction algorithms ranges VLDB 2005 82 Status Range Query Caveat: Against a restricted Opt which stores the average of the values in a bucket. A tutorial on synopsis construction algorithms VLDB 2005 83 The uniform case Consider a sequence X={0,x1,x2,…,xn} Define the operators: (g)[i]=j· i g[j] is the prefix sum A tutorial on synopsis construction algorithms VLDB 2005 84 Unbiased Suppose H is a histogram such that F=(X-H) is s.t. i F[i]=0 Or think of i r<i (X[r]-H[r])=0 Claim: Error of using H to answer range queries for X is twice the error of using (H) to answer point queries about (X) ! A tutorial on synopsis construction algorithms VLDB 2005 85 The main idea Define G[i]=r<i X[i] – H[i] = (X)[i] - (H)[i] Now i G[i] = 0 if H is unbiased Pick a RANDOM elements u Expected[ G[u] ] = 0 Pick two random elements u,v Expected[ (G[u]-G[v])2]=Expected error of using H to answer range queries for X But that is equal to 2 * Expected[ G[u]2 ] A tutorial on synopsis construction algorithms VLDB 2005 86 A simple approximation What we want is: Hard (H) But we know how to get: (X) Piecewise linear histograms! A tutorial on synopsis construction algorithms VLDB 2005 87 An easy trick We can also find: A “buffer” of Size 1 after each bucket Use it as a patch-up 2B buckets Same error as OPT Approximation algorithms try to find the “continuous variant” A tutorial on synopsis construction algorithms VLDB 2005 88 The Synopsis Construction Problem Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F In case of histograms the “dictionary” was the set of all possible intervals – but we could only choose a non-overlapping set. A tutorial on synopsis construction algorithms VLDB 2005 89 The eternal “what if” If the {i} are “designed for the data” do we get a better synopsis ? Absolutely! Consider a Sine wave … Or any smooth fn. Why though ? A tutorial on synopsis construction algorithms VLDB 2005 90 Representations not piecewise const. Electromagnetic signals are sine/cosine waves. If we are considering any process which involve electromagnetic signals – this is a great idea. These are particularly great for representing periodic functions. Often these algorithms are found in DSP (digital signal processing chips) A fascinating 300+ years of history in Math ! A tutorial on synopsis construction algorithms VLDB 2005 91 A slight problem … ni nill cfm back f Ffurir Fourier is suitable to smooth “natural processes” If we are talking about signals from man-made processes, clearly they cannot be natural (and hardly likely to be smooth) … More seriously, discreteness and burstiness… A tutorial on synopsis construction algorithms VLDB 2005 92 The Wavelet (frames) Inherits properties from both worlds Fourier transform has all frequencies. Considers frequencies that are powers of 2 but the effect of each wave is limited (shifted) A tutorial on synopsis construction algorithms VLDB 2005 93 Wavelets What to do in a discrete world ? The Haar Wavelets (1910) ! A tutorial on synopsis construction algorithms VLDB 2005 94 The Haar Wavelets Best “energy” synopsis amongst all wavelets (we will see more later) Great for data with discontinuities. A natural extension to discrete spaces {1,-1,0,0,0,0…}, {0,0,1,-1,0,0,…},{0,0,0,0,1,-1,…}… {1,1,-1,-1,0,0,0,0,…},{0,0,0,0,1,1,-1,-1,…}… A tutorial on synopsis construction algorithms VLDB 2005 95 The Haar Synopsis Problem Formally, given a signal X and the Haar basis {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F Lets begin with the VOPT error (||X-F||22) A tutorial on synopsis construction algorithms VLDB 2005 96 The Magic of Parseval (no spears) The l2 distance is unchanged by a rotation. A set of basis vectors {i} define a rotation iff h i,j i = dij , i.e., Redefine the basis (scale) s.t. ||i||2 = 1 Let the transform be W Then ||X-F||2 = || W(X-F)||2=||W(X) – W(F)||2 Now W(F)={z1,z2,…zn} and so ||W(X) – W(F)||2 = i (W(X)i – zi)2 A tutorial on synopsis construction algorithms VLDB 2005 97 What did we achieve ? Storing the largest coefficients is the best solution. Note that the fact zi=W(X)i is a consequence of the optimization and IS NOT a specification of the problem. More on that later. A tutorial on synopsis construction algorithms VLDB 2005 98 What is the best algorithm ? How to find the largest B coefficients of the set {x1,x2,…} ? Cascade Algorithm. Recall the hierarchical nature. A tutorial on synopsis construction algorithms VLDB 2005 99 Cascade algorithm ? Given a,b represent them as (a-b) and (a+b) Divide by sqrt(2) so that the sum of squares etc… Running time O(n) 1 A tutorial on synopsis construction algorithms 4 5 6 VLDB 2005 100 Surfing Streams Notice that once the left half is done we only need to remember the A stream algorithm is natural 1 4 5 A tutorial on synopsis construction algorithms 6 VLDB 2005 101 Surfing Streams Have an auxillary structure that maintains top B of a set of numbers Where else have you seen this ? Reduce Merge Paradigm Also used in clustering data streams A tutorial on synopsis construction algorithms VLDB 2005 102 In summary Given a series of {x1,x2,…xi,…xn} in increasing order of i we can find (maintain) the largest B coefficients in O(n) time and O(B+log n) space Ok, but only for ||X-F||2 A tutorial on synopsis construction algorithms VLDB 2005 103 Extended Histograms What do we do in presence of multiple dimensions/measures ? Use multi-dim transforms Use many 1 D transforms Indices are large. Correlations Strategy: Use a Flexible scheme that allows us to store the index and a bitmap to indicate which measures are stored. A tutorial on synopsis construction algorithms VLDB 2005 104 How to solve it ? For the basic 1-D problem we need to choose the largest B coefficients Use Parseval to transform error of data to choosing/not choosing coefficients Here we have “bags” We can choose coefficient j with bitmap 0100 using H+S space 0101 using H+2S space 1111 using H+4S space A tutorial on synopsis construction algorithms VLDB 2005 105 Is 0101 better than 1100 ? Subproblem: Given the fact that we have settled on choosing 2 coefficients for j, which 2 ? It is the largest 2 again! Basically we can choose a set of indices j and decide how many coefficients we choose for each j What does this remind you of ? A tutorial on synopsis construction algorithms VLDB 2005 106 Knapsack Each item j is available with M different “versions”. Cost of the rth version is H+rS. The profit is an increasing function of r. Can choose only one version. A tutorial on synopsis construction algorithms VLDB 2005 107 Strange roadbumps Optimal profit + Optimal error= total energy The relationship does not hold in approximation. 99+1=100. Approximating 99 by 95 increases error by 400% We will return to this. A tutorial on synopsis construction algorithms VLDB 2005 108 Many questions What do we do for other error measures ? What is the connection with Histograms ? Positives: Some direction Cascade algorithm Hierarchy of coefficients A tutorial on synopsis construction algorithms VLDB 2005 109 Non l2 errors 110 Storing coefficients is suboptimal Recall the complicate {1,4,5,6} We want a 1 term summary and the error is l1 What do we store ? What is the final Result ? {3.5,3.5,3.5,3.5} What is the transform ? 1 4 5 6 {7,0,0,0} But the set of coefficients available {8,?,?,?} A tutorial on synopsis construction algorithms VLDB 2005 111 What to do ? Search where there is light. Restricted problem. Useful if the synopsis has more than one use. Think outside the coefficients Probabilistic Rounding Search (cleverly) over the whole space A tutorial on synopsis construction algorithms VLDB 2005 112 The Best Restricted Synospis Maximum Error. A value (at the leaf) is affected by only the ancestors. # of ancestors = log n Guess/try all of the set! O(n) choices Start bottom up and use a DP to choose the best B coefficients overall. Works for a large number of error measures. A tutorial on synopsis construction algorithms VLDB 2005 113 Analysis At each internal node j we need to maintain the table Error[j,Ancestor set,b]: the contribution to the minimum error by only the subtree rooted at j when using b or less coefficients (for the subtree) Size of table O(n2B); Time ~ O(n2B log B) [depends on measure ] But we can do better. A tutorial on synopsis construction algorithms VLDB 2005 114 Faster Restricted Synospis A better cut Number of coefficients in a subtree is at most size+1 Size of the table storing Err[j,Ancestor Set,b] Remains constant as we go up the levels! Ancestor set decreases by 1 b takes twice as many values O(n2) algorithm We can also reduce the space to O(n) A tutorial on synopsis construction algorithms VLDB 2005 115 Thinking beyond the coefficient Probabilistic Rounding Start from the coefficients. Randomly round most of them to 0 A few are rounded to non-zero values E.g. set zi= with prob. e-W(X)i/ and 0 otherwise Has promise (correct expectation, variance) Two issues, The quality is unclear (wrt the original optimization) The Expected number of non-zero coefficients is B The variance is large, so with reasonable prob ~ 2B A tutorial on synopsis construction algorithms VLDB 2005 116 More exploration reqd Interestingly the method (as proposed) eliminates a region of search space We can construct examples that the optimum lies in that range. But is an interesting method and likely (I/we are guessing) preserves more errors than one simultaneously (multi-criterion optimization) A tutorial on synopsis construction algorithms VLDB 2005 117 What is the optimum strategy Consider the best set of coefficients Z*={z1,z2,…zn} “nudge” them a bit by making them multiples of some d The “extra error” is small (and a fn of d) In fact each point sees § d log n By reducing d we can get (1+) approx A tutorial on synopsis construction algorithms VLDB 2005 118 A straightforward idea But we still need to find the solution The ancestor set is unimportant – what is important is their combined effect. Try all possible values (multiples of d, but we still need to fix the range) A tutorial on synopsis construction algorithms VLDB 2005 119 The graphs – the data A tutorial on synopsis construction algorithms VLDB 2005 120 The graphs … l1 A tutorial on synopsis construction algorithms VLDB 2005 121 Relative Error (small B), Relative l1 A tutorial on synopsis construction algorithms VLDB 2005 122 The times A tutorial on synopsis construction algorithms VLDB 2005 123 What have we seen so far Wavelet representation of l_2 error Streaming Wavelet representation for non l_2 error Restricted Unrestricted Stream A tutorial on synopsis construction algorithms VLDB 2005 124 A return to histograms 125 Easy relationships A B-bucket (piecewise constant) histogram can be represented by 2B log n Haar wavelet coefficients. Why Only the 2B boundary points matter A B-term Haar wavelet synopsis can be represented by 3B-bucket histogram. Why Each wavelet basis creates 3 extra pieces from 1 line A tutorial on synopsis construction algorithms VLDB 2005 126 Anything else ? Totally! We can use Wavelets to get (1+\epsilon)-approximate V-optimal histograms. In fact the method has advantages… A tutorial on synopsis construction algorithms VLDB 2005 127 Histograms, Take 5: A B-term Histogram can be represented by cB log n wavelet terms. What is we choose the largest cB log n wavelet terms ? A tutorial on synopsis construction algorithms VLDB 2005 128 Need not be good. The best histogram has the cB log n wavelets “aligned” such that the result is B buckets. The best cB log n coefficients are all over the place and give us 3cB log n buckets. All hope is lost ? A tutorial on synopsis construction algorithms VLDB 2005 129 If at first you don’t succeed… We repeat the process and also keep the next cB log n coefficients … No. But notice that the “energy” drops. Energy = ||X||2=||W(X)||2 Basic intuition: If there were a lot of coefficients which were large then the best V-Opt histogram MUST have a large error. Why? A tutorial on synopsis construction algorithms VLDB 2005 130 The “robust” property Look at ||W(X)-W(H)||2=||X-H||2 W(H) has cB log n entries If W(X) has cB-2 log n large entries .. A tutorial on synopsis construction algorithms VLDB 2005 131 A strange idea in 1000 words Consider the projection to the largest cB-2 log n wavelet terms Is … ¼ ? A tutorial on synopsis construction algorithms VLDB 2005 132 No. But flatten the fn X ¼ A tutorial on synopsis construction algorithms VLDB 2005 133 In fact If we chose (Blog n)O(1), i.e., large, number of coefficients then the boundary points of the coefficients are (approximately) good boundary points for a VOPT histogram. A tutorial on synopsis construction algorithms VLDB 2005 134 The take away: I’m ok you’re ok If I’m not ok then you’re not ok too. An oft repeated approximation paradigm “if there are too many coefficients then my algorithm is doomed – but so is anyone elses, and therefore I am good” “if there are not too many coefficients then we’re good”. A tutorial on synopsis construction algorithms VLDB 2005 135 The Extended Wavelets in l2 We can store the largest coefficients If there are too many coefficients which are large then optimum error is large. Otherwise we repeatedly take out coefficients till taking out coefficients will not reduce the error any more. DP on the set of coefficients taken out. A tutorial on synopsis construction algorithms VLDB 2005 136 The Full Monty – update streams So far we have been looking at X arriving as {x1,x2,…} What happens when X is specified by a stream of updates ? i.e., (i,di)=change xi to xi + di A tutorial on synopsis construction algorithms VLDB 2005 137 Sketches :Stream Embeddings Basically Dimensionality reduction To compute the histogram H of signal X Compute embedding g(X) to fit the space Compute H s.t. g(H) is close to g(X) A tutorial on synopsis construction algorithms VLDB 2005 138 Linear Embeddings [JL Lemma ] x 2 Ax 2 (1 ) x 2 A is a Random ( 2 log n) n Matrix drawn from Gaussian distribution. Too many elements in matrix! Use Pseudorandom Generators P-Stable distribution for A tutorial on synopsis construction algorithms p where p [ 0, 2] VLDB 2005 139 What it achieves Computes Norm Increasing the coordinate is adding the column to sketch. A x A tutorial on synopsis construction algorithms VLDB 2005 140 Suppose we knew the intervals The best histogram minimizes ||X-H||2 ¼ ||AX –AH ||2 AX is a vector, AH is a linear function of B values We have a min sq. error program, solvable in ptime more involved in 1-norm. A tutorial on synopsis construction algorithms VLDB 2005 141 Cannot do that ||X-H||2 = ||W(X) – W(H)||2 ¼ ||AW(X) –AW(H) ||2 Idea: Use the linear map to find the large number of Wavelet coefficients (top k problem using sketches) Use similar ideas to Take 5 to get the final solution. A tutorial on synopsis construction algorithms VLDB 2005 142 The return of the pink Fourier Assuming x1,x2,…,xi,… arrive in increasing order of i, find/maintain the top k Fourier coefficients. Use the strategy : Assume that there are O(k log n) frequencies and try to find them. If not, we are doomed and so is everyone. So we are ok. For the 3rd time … A tutorial on synopsis construction algorithms VLDB 2005 143 What about top k Assuming x1,x2,…,xi,… are specified by a stream of updates find/maintain the top k values (all elements with frequency ~1/k or more). Use the strategy : Assume that there are O(k log n) elements and try to find them. If not, we are doomed and so is everyone. So we are ok. Again! Use Group testing 20 questions, bit chasing – is an heavy item in the first half ? You can use norms – or you can use collisions (hashes). A tutorial on synopsis construction algorithms VLDB 2005 144 From optimization to learning We are trying to “learn” a “pure” signal that has few coefficients… A general paradigm. A tutorial on synopsis construction algorithms VLDB 2005 145 The Meaning of Life In Summary (high level): Approximation is very useful for synopsis construction (the execution time speedups plus “the end use of synopsis is approximation only”) Synopses are usually applied on large data. Asymptotic behaviour matters The exact definition of the optimization is important. How natural is natural… Few degrees of separation between the synopsis structures. They are related. They should be. But then we can use algorithmic techniques back and forth between them. A tutorial on synopsis construction algorithms VLDB 2005 146 The Summary (contd.) In algorithm design terms Most synopsis construction problems involve DP. Investigating how to change the DP to get approximation, space efficient algs., is often useful. Search techniques (computation geometry) – search exponents first are useful. What you analyze (carefully) is often what you would get asymptotically. The usual techniques we use for pruning etc., can be analyzed and and shown to be better. Reduce-Merge ) Streaming ? The top k in various disguises. Group testing matters. A tutorial on synopsis construction algorithms VLDB 2005 147 What lies ahead Ok. So 1 D histograms have good algos. 2D? NP-Hard. Some approximation algorithms known. Q: In linear time and sublinear space what can we do ? Sketch based results. Long way to go. A tutorial on synopsis construction algorithms VLDB 2005 148 What lies ahead So 1 D Haar Wavelets have good algos (non l2). 2D? Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation algorithms known. Q: In linear time and sublinear space what can we do ? A tutorial on synopsis construction algorithms VLDB 2005 149 What lies ahead So 1 D Haar Wavelets have good algos (non l2). Non Haar ? Daubechies. Multifractals. Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation algorithms known. What can we do ? A tutorial on synopsis construction algorithms VLDB 2005 150 What lies ahead All the update stream results are based on l2 error because of Johnson Lindenstrauss (and some on lp for 0<p· 2) What about other errors ? Will require new techniques for streaming. A tutorial on synopsis construction algorithms VLDB 2005 151 Notes (not from the underground) The VOPT definition Poosala, Haas, Ioannidis, Shekita, SIGMOD `96. Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, VLDB ‘98. Guha, Koudas, Shim, STOC, ‘01. Guha, Koudas, ICDE, ‘02. Guha, Koudas, Shim, TODS, ‘05. Guha, Indyk, Muthukrishnan, Strauss, ICALP, ‘02. Guha, Shim, Woo, VLDB, ‘04. The VOPT histogram algorithm Take 1 Take 2 Take 3 & 4 Take 5 Relative Error Histograms Maximum Error histograms Nicole, J. of Parallel Distributed Computing, 1994. (Muthukrishnan, Khanna, Skiena, ICALP, ’97), Guha, Shim, (here) ‘05. A tutorial on synopsis construction algorithms VLDB 2005 152 More Notes Range Query Histograms Muthukrishnan, Strauss, SODA, ‘03. The Full Monty Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss, STOC, ‘02. Parseval stuff Parseval, (margin of notebook ?), 1799. The mandala Gilbert, Kotidis, Muthukrishnan,Strauss, VLDB, ‘01 Gibbons, Garofalakais, SIGMOD, ’02 (also TODS, ‘04) Garofalakis, Kumar, PODS, ‘04. Folklore sum of squares and l2 Surfing Wavelets Probabilistic Synopsis Maximum error (restricted version) A tutorial on synopsis construction algorithms VLDB 2005 153 Notes again Faster Restricted Synopsis Guha, VLDB, ‘05. Guha, Harb, KDD, ‘05 + new results Unrestricted non l2 error Extended Wavelets Deligiannakis Rossopolous, SIGMOD ’03. Guha, Kim, Shim, VLDB ’04. Gilbert, Guha, Indyk, Muthukrishnan, Strauss, STOC, ’02 Linial, Kushilevitz, Mansour, JACM, 93 Johnson, Lindenstrauss, , ’84. Streaming Fourier approximation Learning Fourier Coefficients JL Lemma Sketches Alon, Matias, Szegedy, JCSS, ’99. Feigenbaum Kannan, Vishwanathan, Strauss, FOCS, ’99 Indyk, FOCS, ‘00 A tutorial on synopsis construction algorithms VLDB 2005 154 Roads not taken (but are relevant to synopsis) Property Testing Weighted sampling and SVD Median Finding Sampling based estimators A tutorial on synopsis construction algorithms VLDB 2005 155