Slide - VLDB 2005

advertisement
Offline, Stream and Approximation
Algorithms for Synospis
Construction
Sudipto Guha University of Pennsylvania
Kyuseok Shim Seoul National University
1
About this Tutorial


Information is incomplete and could be
inaccurate
Our presentation reflects our
understanding which may be erroneous
A tutorial on synopsis construction algorithms
VLDB 2005
2
Synopses Construction
Where is the life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
T. S. Eliot, from The Rock.




Routers
Sensors
Web
Astronomy and sciences
Too much data too little time.
A tutorial on synopsis construction algorithms
VLDB 2005
3
The idea


To see the world in a grain of sand…
Broad characteristics of the data
 Compression
 Dimensionality Reduction
 Approximate query answering
 Denoising, Outlier Detection and a broad
array of signal processing
A tutorial on synopsis construction algorithms
VLDB 2005
4
What is a synopsis ?





Hmm.
Any “shorthand” representation
Clustering!
SVD!
In this tutorial we will focus on signal/time
series processing
A tutorial on synopsis construction algorithms
VLDB 2005
5
The basic problem


Formally, given a signal X and a
dictionary {i} find a representation
F=i zi i with at most B non-zero zi
minimizing some error which is a fn of
X-F
Note, the above extends to any dim.
A tutorial on synopsis construction algorithms
VLDB 2005
6
Many issues




What is the dictionary ?
Which B terms ?
What is the error ?
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
7
Many issues

What is the dictionary ?


Set of vectors
Maybe a basis
Top K



Which B terms ?
What is the error ?
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
8
Many issues

What is the dictionary ?


Set of vectors
Maybe a basis
Haar Wavelets



Which B terms ?
Also Fourier, Polynomials,…
What is the error ?
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
9
Many issues

What is the dictionary ?

Set of vectors
May not be a basis

Histograms:






There are n choose 2 vectors
But since we impose a non-overlapping restriction we get
a unique representation.
Which B terms ?
What is the error ?
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
10
Many issues


What is the dictionary ?
Which B terms ?




First B ?
Best B ?
Why should we
choose first B ?
1. B vs 2B numbers
2. Also …
What is the error ?
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
11
Approximation theory






Discipline of Math associated with approximation of
functions.
Same as our problem
Linear theory (Parseval, 1800 over two centuries)
Non-Linear theory (Schmidt 1909, Haar 1910)
Is it relevant ? Yes. However Math treatment has been
“extremal”, i.e., how does the error change as a function
of B. Is that bound tight?
Note: a yes answer does not say anything about “given
this signal, is that the best we can do ?”
A tutorial on synopsis construction algorithms
VLDB 2005
12
Many issues



What is the dictionary ?
Which B terms ?
What is the error ?




This controls which B.
||X-F||2 is most common, used all over in mathematics
||X-F||1,||X-F||1 are useful also
Weights. Relative error of approximation



1000 by 1010 is not so bad.
1 by 11 is not too good an idea.
What are the constraints ?
A tutorial on synopsis construction algorithms
VLDB 2005
13
Many issues




What is the dictionary ?
Which B terms ?
What is the error ?
What are the constraints ?


Input ? Stream, stream of updates …
Space, time, precision and range of values
(for zi in the expression F=i zi i )
A tutorial on synopsis construction algorithms
VLDB 2005
14
In this tutorial




Histograms & Wavelets
Will focus on Optimal, Approximation and
Streaming algorithms
How to get one from the other!
Connections to top K and Fourier.
A tutorial on synopsis construction algorithms
VLDB 2005
15
I. Histograms.
A tutorial on synopsis construction algorithms
VLDB 2005
16
VOpt Histograms






Lets start simple
Given a signal X, find a piecewise constant
representation H with at most B pieces minimizing ||XH||2
Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel,
1998
Consider one bucket.
The mean is the best value.
A natural Dynamic programming formulation
A tutorial on synopsis construction algorithms
VLDB 2005
17
An Example Histogram
Data Distribution
Location (i)
1
2
3
4
5
6
7
Value (Xi)
12
10
2
8
14
28
16
V-Optimal Histogram
Range
Representative
A tutorial on synopsis construction algorithms
[1,4]
[5,5]
[6,6]
[7,7]
8
14
28
16
VLDB 2005
18
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
 Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?

OPT[j,k-1]
SQERR[j+1,n]
Last bucket
j
1
j 1
n
Must also be optimal for the range [1, j] with (k-1) buckets!
Dynamic Programming !!
A tutorial on synopsis construction algorithms
VLDB 2005
19
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
 Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?

OPT[j,k-1]
SQERR[j+1,n]
Last bucket
1
j
j 1
n
Must also be optimal for the range [1, j] with (k-1) buckets!
Dynamic Programming !!
A tutorial on synopsis construction algorithms
VLDB 2005
20
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
 Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?

OPT[j,k-1]
SQERR[j+1,n]
Last bucket
1
j j 1
n
Must also be optimal for the range [1, j] with (k-1) buckets!
Dynamic Programming !!
A tutorial on synopsis construction algorithms
VLDB 2005
21
Idea: VOpt Algorithm


Dynamic programming algorithm was given to
construct the V optimal Histogram.
OPT[n,k] = min {OPT[j,k-1,]+SQERR[(j+1)..n]}
1≤j<n


OPT[j, k] : the minimum cost of representing the
set of values indexed by [1..j] by a histogram
with k buckets.
SQERR[(j+1)..n]: the sum of the squared absolute
errors from (j+1) to n.
A tutorial on synopsis construction algorithms
VLDB 2005
22
The DP-based VOpt Algorithm
for i=1 to n do
for k=1 to B do
for j=1 to i-1 do (split pt of k-1 bucket hist. and last
bucket)
OPT[i, k] = min{ OPT[i, k], OPT[j,k-1] + SQERR[j+1,i] }
OPT
B
n



We need O(Bn) entries for the table OPT
For each entry OPT[i,k], it takes O(n) time if SQERR[j+1.i] can
be computed O(1) time
O(Bn) space and O(Bn2) time
A tutorial on synopsis construction algorithms
VLDB 2005
23
Computation of Sum of Squared
Absolute Error in O(1) time
index
1
2
3
4
x
2
3
7
5
sum
2
5
12
17
sum(2,3) = x[2]+x[3] = sum[3]-sum[1]= 12-2 = 10
A tutorial on synopsis construction algorithms
VLDB 2005
24
Computation of Sum of Squared
Absolute Error in O(1) time
i
Let SQSUM[1, i]   x
p 1
Then,
i
and SUM[1, i]   x p
2
p
p 1
j
SQSUM(i, j )   x 2p  SQSUM[ j ]  SQSUM[i  1]
p i
j
SUM (i, j )   x p  SUM[ j ]  SUM[i  1]
p i
Thus, SQERR[i, j ]  ( x  x) 2  x 2 
 p
 p
j
j
p i
p i
j
1
( x p ) 2
j  i  1 p i
1
 ( SQSUM[1, i]  SQSUM[1, j  1]) 
( SUM[1, i]  SUM[1, j  1])2
j  i 1
A tutorial on synopsis construction algorithms
VLDB 2005
25
Analysis of VOpt Algorithm



O(n2B) time O(nB) space
The space can be reduced (Wednesday)
Main Question : The end use of
histogram is to approximate something.

Why not find an “approximately optimal”
(e.g., (1+ε) ) histogram?
A tutorial on synopsis construction algorithms
VLDB 2005
26
If you had to improve something ?
O(n2B) time
O(nB) space
Via Wavelets ssq
O(n) time
O(B2/2) space
(1+) streaming
O(nB2/) time.
O(B2/) space
O(n2B) time
O(n) space
(1+) streaming
O(n) time.
O(B2/) space
A tutorial on synopsis construction algorithms
offline
O(n) time.
O(B2/) space
(1+) streaming ssq
O(n) time.
O(B/2) space
Offline
O(n) time.
O(n+B/) space
VLDB 2005
27
Take 1:
For i=1 to n do
For K=1 to B do
For j=1 to i-1 do (split point for the last bucket)
OPT[ 1…i, k] =
Min [ OPT[1…i, k], OPT[1…j,k-1]+
SQERR(j+1,i) ]



OPT[1..j,k] is increasing
SQERR(j+1,i) is decreasing
As j increases
Question: Can we use the monotonicity for searching the
minimum ?
A tutorial on synopsis construction algorithms
VLDB 2005
28
No


Consider a sequence of positive y1,y2,…,yn
F(i) = i yi and G(i) = F(n) – F(i-1)




F(i): monotonically increasing … Opt[1..j,k-1]
G(i): monotonically deceasing … SQERR(j+1,i)
(n) time is necessary to find mini{ F(i)+G(i) }
Open Question: Does it extend to (n2) over
the entire algorithm ?
A tutorial on synopsis construction algorithms
VLDB 2005
29
What gives ?


Consider a sequence of positive y1,y2,…,yn
F(i) = i yi and G(i) = F(n) – F(i-1)


Thus, F(i)+G(i) = F(i) + xi
Any i gives a 2 approximation to mini{ F(i) +
G(i)}


F(i) + G(i) = F(n) + xi ≤ 2 F(n)
mini{ F(i) + G(i)} is at least F(n)
A tutorial on synopsis construction algorithms
VLDB 2005
30
Round 1




Use a histogram to approximate the fn
Bootstrap!
Approximate the increasing fn in powers of (1+d)
Right end pt is (1+d) approximation of left end pt
·(1+d)h
h
A tutorial on synopsis construction algorithms
VLDB 2005
31
What does that do ?


Consider evaluating the fn at the
two endpoints
Proof by picture.
(1+d)
¸
h
h’
Why ?
By construction.
Why ?
By monotonicity!
¸
A tutorial on synopsis construction algorithms
VLDB 2005
32
Therefore…


The right hand point is a (1+δ)
approximation!
Holds for any point in between.






SQERR
OPT
OPT[x]+SQERR[x+1]≥ OPT[a]+SQERR[b]
≥ OPT[b]/(1+ δ) + SQERR[b]
≥ {OPT[b] + SQERR[b]}/ (1+δ)
Are we done ?
Not quite yet.
What happens for B>2 ? – we do not
h’
a
b
compute OPT[i,b] exactly !!
A tutorial on synopsis construction algorithms
VLDB 2005
33
Zen and the art of histograms




Approximate the increasing fn in powers of (1+d)
Right end pt is (1+d) approximation
Prove by induction that the error is (1+d)B
This tells us what d should be (small), in fact if we
set d=/2B then (1+d)B· 1+
A tutorial on synopsis construction algorithms
VLDB 2005
34
Complexity analysis


# of intervals p ~ (B/) log n
Why ?





c(1+δ) (p-1) ≤ nR2 and δ = /(2B)
R is the largest number in data
Assume R is polynomially bounded by n
Running time ~ nB (B/) log n
Why are we approximating the increasing
function ? Why not the decreasing one ?
A tutorial on synopsis construction algorithms
VLDB 2005
35
The first streaming model



The signal X is specified by xi arriving
in increasing order of i
Not the most general model
But extremely useful for modeling time
series data
A tutorial on synopsis construction algorithms
VLDB 2005
36
Streaming
1b xi
1b x2i
Need to store
1a xi
1a x2i
a
b
Required space is (B2/) log n
A tutorial on synopsis construction algorithms
VLDB 2005
37
VOpt Construction: O(Bn2)


[Jagadish et al.: VLDB 1998]
OPT(i,k) = min1≤j<i{OPT(j,k-1)+SQERR(j+1,i)}
OPT[j,k]
10
7
8
9
3
4
5
6
1 2
OPT[j,k-1]
8 9 10
1 2 3 4 5 6 7
n
A tutorial on synopsis construction algorithms
n
VLDB 2005
38
AHIST-S: (1+ε) Approximation


AOPT[j,k] = min1≤j<i{AOPT[bjp,k-1]+SQERR[bjp+1,n]}
O(B2ε-1nlogn) time and O(B2ε-1logn) space
AOPT[j,k]
δ = ε /2B
(1+δ)a ≥b
AOPT[j,k-1]
bc
a
(1+δ)a < c
A tutorial on synopsis construction algorithms
n
P
VLDB 2005
P = O(Bε-1logn)
39
The overall idea
The natural DP table
The approximate table
A tutorial on synopsis construction algorithms
VLDB 2005
40
Do s talk to us ?

DJIA data from 1901-1993
execution time
B
A tutorial on synopsis construction algorithms
VLDB 2005
41
Take 2: GK02





Sliding window streams
Potentially infinite data – interested in the
last n only
Q: Suppose we constructed histogram for
[1..n] and now want it for [2..(n+1)]
Previous idea is a dead on arrival.
Consider 100,1,2,3,4,5,7,8,…
A tutorial on synopsis construction algorithms
VLDB 2005
42
Formal problem



Maintain a data structure
Given an interval [a,b] construct a B
bucket histogram for [a,b]
Compute on the fly


Generalizes the window!
Generalizes VOpt when a=1,b=n
A tutorial on synopsis construction algorithms
VLDB 2005
43
Reconsider the take 1

We are evaluating

Left to right, i.e.,
But we are still evaluating this guy !
A tutorial on synopsis construction algorithms
VLDB 2005
44
A brave new world


Assume a O(n) size buffer holds xi values
The previous algorithm was:
Several issues
1. Which values are necessary and sufficient
2. We are not evaluating all values – what induction ?
A tutorial on synopsis construction algorithms
VLDB 2005
45
GK02: Enhanced (1+ε) Approximation


Lazy evaluation using Binary Search
O(B3ε-2log3n) time and O(n) space
 Pre-processing takes O(n) time – SUM and SQSUM
(1+δ)a ≥z
AOPT[j,k]
b
a
(1+δ)a < z+1
AOPT[j,k-1]
P
A tutorial on synopsis construction algorithms
P = O(Bε-1logn)
n
VLDB 2005
47
GK02: Enhanced (1+ε) Approximation






Creates all of B interval lists at once
The values of necessary AOPT[j,k] are computed
recursively to find the intervals [ajp, bjp] where
bjp is the largest z s.t.
 (1+ε) AOPT[ajp,k] ≥ (1+ε) AOPT[z,k]
 (1+ε) AOPT[ajp,k] < (1+ε) AOPT[z+1,k]
Note that AOPT increases as z increases
Thus, we can use binary search to find z
O(n) space of SUM and SQSUM arrays needs to
be maintain to allow the computation of
SQERR(j+1,i) in O(1) time
O(n+B3ε-2log3n) time and O(n) space
A tutorial on synopsis construction algorithms
VLDB 2005
48
Take 2 summary

O(n) space and O(n+B3-2log2 n) time

Is that the best ? Obviously no.
A tutorial on synopsis construction algorithms
VLDB 2005
49
Take 3: AHIST-L- 




Suppose we knew  · OPT · 2  then…
Instead of powers of (1+/B) additive terms of
/(2B) then …
Time is O(B3-2 log n)
O(B/)
To get  ?




2-approximation: =O(1)
a binary search: O(log n)
Thus, O(B3 log n * log n)
Overall O(n+B3(-2+logn)log n) time and O(n+B2/)
space:
A tutorial on synopsis construction algorithms
VLDB 2005
50
Take 4: AHIST-B


Consider the take 4 algorithm.
How to stream it ?
On the new part
Overall
A tutorial on synopsis construction algorithms
M
VLDB 2005
51
Not done yet
k
K-1
1+r

First find an  =O(1) approximation,
then proceed back and refine
A tutorial on synopsis construction algorithms
VLDB 2005
52
The running space-time



B(# insertions)(log M)(log ) where
=O(B-1 log n) is the length of a list
Space
Who cares and why ?
A tutorial on synopsis construction algorithms
VLDB 2005
53
Asymptotics

For fixed B and , we can compute a (1+ )
piecewise constant representation in

O(n log log n) time and O(log n) space or

O(n) time and O(log n log log n) space.

Extends to degree d polynomials, space
increases by O(d) and time is O(nd + d3…)
A tutorial on synopsis construction algorithms
VLDB 2005
54
Execution Time
Our friendly  : Running time
B
A tutorial on synopsis construction algorithms
VLDB 2005
55
(Error –VOPT)/VOPT
Our friendly  : Error
B
A tutorial on synopsis construction algorithms
VLDB 2005
56
Execution time
What you analyze is what you get
A tutorial on synopsis construction algorithms
n
VLDB 2005
57
Questions ?
A tutorial on synopsis construction algorithms
VLDB 2005
58
For general error measure, IF…

The error of a bucket only depends on the values in the
bucket.

The overall error function, is the sum of the errors in
the buckets.

The data can be processed in O(T) time per item such
that in O(Q) time we can find the error of a bucket,
storing O(P) info.

The error (of a bucket) is a monotonic function of the
interval.

The value of the maximum and the minimum nonzero
error is polynomially bounded in n.
A tutorial on synopsis construction algorithms
VLDB 2005
60
Then…


Optimum histogram in time O(nT+n2(B+Q)) time and
O(n(P+B)) space
(1+)-approximation in

O(nT+nQB2-1 log n) time and O(PB2-1 log n) space,

O(nT + QB3(log n + -2 )log n) time and O(nP) space

O(nT) time and space
O(PB2 -1 log n + (QB/T)
A tutorial on synopsis construction algorithms
[B
-1
log2 (B-1 log n) + log n loglog n)
VLDB 2005
]
61
Splines and piecewise polynomials

Instead of

If we wanted

Or maybe…
A tutorial on synopsis construction algorithms
VLDB 2005
62
The overall idea

If we want to represent {xa+1,…,xb} by p0+p1(x-xa)+p2(x-xa)2 + …

The solution is as above…

We need O(d) times (than before) space and need to solve the
system. This means an increase by a factor O(d3) in time.
A tutorial on synopsis construction algorithms
VLDB 2005
63
Another useful example: Relative error



Issue with global measures: Estimating 10 by 20
and 1000 by 1010 has the same effect
The above is ok if we are querying for “1000” a
1000 times and 10 times for “10” (point queries and
VOPT measure)
But consider approximating a time series. We may
be interested in per point guarantees.
A tutorial on synopsis construction algorithms
VLDB 2005
64
Sum of Squared Relative Error
for a Bucket

Relative error for a bucket (sr,er,xr) :
( xi  xr ) 2
2
ERRSQ (sr , er )  min{ 
}

Ax
r  2 Bxr  C
2
2
x
i  sr max{c , x }
i
er




Since A > 0, it is minimized when xr=B/A
The minimum value is C-B2/A
If the aggregated sum of A, B and C are stored,
ERRSQ(i,j) can be computed in O(1) time
Optimal histogram can be constructed in O(Bn2)
time… Approximation algorithms follow…
A tutorial on synopsis construction algorithms
VLDB 2005
65
Maximum Error and the l1 metric
A tutorial on synopsis construction algorithms
VLDB 2005
66
Maximum Error Histograms

A bucket (sr,er,xr) with a numbers {x1, x2, …, xn} s.t.




sr: starting position
er: ending position
xr: representative value
Maximum Error is given by
ERR M ( sr , er )  min max | xi  xr |
xr

i[ sr ,er ]
Maximum relative error is defined as:
| xi  xr |
ERRM ( sr , er )  min{ max
}
xr
i[ sr ,er ] max{c, | x |}
i
A tutorial on synopsis construction algorithms
VLDB 2005
67
Maximum Error of a bucket




Given numbers {x1, x2, …, xn} s.t.
Maximum Error is given by ErrM=minxr maxi |xi – xr|
What is the best xr
(xmin+xmax)/2
A tutorial on synopsis construction algorithms
VLDB 2005
68
Maximum Relative Error of a set

Given a set of numbers {x1, x2, …, xn}




max: the maximum of {x1, x2, …, xn}
min: the minimum of {x1, x2, …, xn}
c: A sanitary constant
Some function of c,max,min


E.g., when c· min· max the error is
Optimal maximum relative error for a bucket can be
computed in O(1) time
A tutorial on synopsis construction algorithms
VLDB 2005
69
The Naïve Optimal Algorithm
for i :=1 to n do
OPTM[i,1] := ERRM(i,i)
for K :=1 to B do {
max := - ∞; min := ∞; OPTM[i,k] := ∞
for j :=i-1 to 1 do {
if (max < x[j+1]) max := x[j+1]
if (min > x[j+1]) min := x[j+1]
OPTM[i,k] := min{OPTM[i,k] ,
max( OPTM[j,k-1], ERRM(j+1,i) ) }
}
}
}
 ERRM(j+1,i) can be obtained in O(1) time
 O(Bn) space and O(Bn2 ) time optimal algorithm
A tutorial on synopsis construction algorithms
VLDB 2005
70
An Improved Optimal Algorithm
OPTM[i,j] := minj{max( OPTM[j,k-1], ERRM(j+1,i)) }
 Observations





OPTM[j,k-1] is an increasing function
ERRM(j+1,i) is a decreasing function
To compute minx{ max ( F(x), G(x) ) } where F(x)
and G(x) are non-decreasing and non-increasing
functions
We can perform binary search for the value of x
such that F(x) > G(x) and F(x-1) < G(x-1)
The minimum is min{ G(x-1) and F(X) }
A tutorial on synopsis construction algorithms
VLDB 2005
71
An Improved Optimal Algorithm
OPTM[i,j]:= min{max(OPTMj,k-1], ERRM(j+1,i))}




We can improve the most inner loop of Naïve
algorithm in O(log n) time.
However, ERRM(j+1,i) cannot be computed in
O(1) time any more
Using an interval tree, we can compute min
and max values for [j+1, i], i.e. ERRM(j+1,i), in
O(log n) time
Thus, our improved algorithm takes O(Bn
log2n) time with O(Bn) space
A tutorial on synopsis construction algorithms
VLDB 2005
72
An Interval Tree Example
[2,4]
[1,8]
Min Interval
[1,4]
decomposeRight
decomposeLeft
[1,2]
[1,1]
[2,2]
[5,8]
[3,4]
[3,3]
[4,4]
[5,6]
[5,5]
[6,6]
[7,8]
[7,7]
[8,8]
The steps of decomposing [2,4] with an interval tree
A tutorial on synopsis construction algorithms
VLDB 2005
73
Consider another solution





Make the first bucket as
large as possible
i.e. push the boundary right
E.g. in the figure we can….
As long as the max and min is
same…
Why will we have to stop ?
A tutorial on synopsis construction algorithms
VLDB 2005
74
Consider another solution (2)



In this example we cannot…
But may be the error comes
from a different bucket!
Here’s one idea




Given an i, find Err[1,i]
If i is small Err[1,i] · OPT
If i is large Err[1,i] ¸ OPT
How ?
A tutorial on synopsis construction algorithms
By binary search !
Observe that given an
error , it is easy to check
if the error can be realized
by B buckets
VLDB 2005
75
How ?

Assume given an interval [a,b], we can find the min and
max, and therefore Err[a,b]


With O(n) time and space preprocessing, we can find Err[]
in O(log n) time. (interval tree)
Check[p,q,b,]:


If q > p (for b¸ 0), we are done.
Otherwise,



Find mid, s.t. Err[p,mid] ·  and Err[p,mid+1] > 
Check[mid+1,q,b-1,]
O(B log2 n)


Binary Search: log n * log n (to find min and max for Err)
Invocation of Check: B times
A tutorial on synopsis construction algorithms
VLDB 2005
76
Now for the original problem

By binary search, find largest s such that





When =Err[1,s] and ’=Err[1,s+1],
Check[1,n, B-1 ]=false and Check[1,n, B-1, ’]=true
Now OPT=’ or the best B-1 bucket error of
[s+1,n]
A recursive algorithm!
T(B)= log n * B log2 n + T(B-1) ¼ O(B2 log3 n) !!
Check[]
A tutorial on synopsis construction algorithms
VLDB 2005
77
Summary


In O(n + B2 log3 n) time and O(n) space we can
find the optimum error.
What do we do if



Stream or
Less than O(n) space ?
Approximate, using some of the old ideas…
A tutorial on synopsis construction algorithms
VLDB 2005
78
Short break !
When we return
•Range Query Histograms
•Wavelets
• Optimum synopsis
• Connection to Histograms
•Overall ideas and themes
79
Range Query Histograms
A tutorial on synopsis construction algorithms
VLDB 2005
80
A more synopsis structure




Instead of estimating the value at a
point we are interested in sum of
the values in intervals/ranges.
Clearly, very useful.
Clearly we need new optimization.
E.g.,
A tutorial on synopsis construction algorithms
Not useful, in
this example
VLDB 2005
81
A more difficult problem

Only special cases solved (satisfactorily)


Hierarchies:
 Prefix ranges: All ranges of form [1,j] as j varies
 Complete Binary Ranges
 General hierarchies
Uniform Ranges: all
A tutorial on synopsis construction algorithms
ranges
VLDB 2005
82
Status Range Query

Caveat:
Against a restricted Opt
which stores the average of
the values in a bucket.
A tutorial on synopsis construction algorithms
VLDB 2005
83
The uniform case

Consider a sequence X={0,x1,x2,…,xn}

Define the operators:

(g)[i]=j· i g[j] is the prefix sum
A tutorial on synopsis construction algorithms
VLDB 2005
84
Unbiased



Suppose H is a histogram such that
F=(X-H) is s.t. i F[i]=0
Or think of i r<i (X[r]-H[r])=0
Claim: Error of using H to answer range
queries for X is twice the error of
using (H) to answer point queries
about (X) !
A tutorial on synopsis construction algorithms
VLDB 2005
85
The main idea



Define G[i]=r<i X[i] – H[i] = (X)[i] - (H)[i]
Now i G[i] = 0 if H is unbiased
Pick a RANDOM elements u


Expected[ G[u] ] = 0
Pick two random elements u,v


Expected[ (G[u]-G[v])2]=Expected error of using H to
answer range queries for X
But that is equal to 2 * Expected[ G[u]2 ]
A tutorial on synopsis construction algorithms
VLDB 2005
86
A simple approximation

What we want is:


Hard
(H)
But we know how to get:
(X)
Piecewise linear histograms!
A tutorial on synopsis construction algorithms
VLDB 2005
87
An easy trick

We can also find:





A “buffer” of Size 1 after each bucket
Use it as a patch-up
2B buckets
Same error as OPT
Approximation algorithms try to find the
“continuous variant”
A tutorial on synopsis construction algorithms
VLDB 2005
88
The Synopsis Construction Problem


Formally, given a signal X and a dictionary {i}
find a representation F=i zi i with at most B
non-zero zi minimizing some error which a fn of
X-F
In case of histograms the “dictionary” was the
set of all possible intervals – but we could only
choose a non-overlapping set.
A tutorial on synopsis construction algorithms
VLDB 2005
89
The eternal “what if”

If the {i} are “designed for the data”
do we get a better synopsis ?

Absolutely!
Consider a Sine wave …
Or any smooth fn.

Why though ?


A tutorial on synopsis construction algorithms
VLDB 2005
90
Representations not piecewise const.





Electromagnetic signals are sine/cosine waves.
If we are considering any process which involve
electromagnetic signals – this is a great idea.
These are particularly great for representing periodic
functions.
Often these algorithms are found in DSP (digital
signal processing chips)
A fascinating 300+ years of history in Math !
A tutorial on synopsis construction algorithms
VLDB 2005
91
A slight problem …

ni nill cfm back f Ffurir

Fourier is suitable to smooth “natural processes”


If we are talking about signals from man-made
processes, clearly they cannot be natural (and hardly
likely to be smooth) …
More seriously, discreteness and burstiness…
A tutorial on synopsis construction algorithms
VLDB 2005
92
The Wavelet (frames)

Inherits properties from both worlds

Fourier transform has all frequencies.

Considers frequencies that are powers of 2 but the
effect of each wave is limited (shifted)
A tutorial on synopsis construction algorithms
VLDB 2005
93
Wavelets

What to do in a discrete world ?
The Haar Wavelets (1910) !
A tutorial on synopsis construction algorithms
VLDB 2005
94
The Haar Wavelets



Best “energy” synopsis amongst all
wavelets (we will see more later)
Great for data with discontinuities.
A natural extension to discrete spaces


{1,-1,0,0,0,0…}, {0,0,1,-1,0,0,…},{0,0,0,0,1,-1,…}…
{1,1,-1,-1,0,0,0,0,…},{0,0,0,0,1,1,-1,-1,…}…
A tutorial on synopsis construction algorithms
VLDB 2005
95
The Haar Synopsis Problem


Formally, given a signal X and the Haar basis
{i} find a representation F=i zi i with at
most B non-zero zi minimizing some error which
a fn of X-F
Lets begin with the VOPT error (||X-F||22)
A tutorial on synopsis construction algorithms
VLDB 2005
96
The Magic of Parseval (no spears)


The l2 distance is unchanged by a rotation.
A set of basis vectors {i} define a rotation iff






h i,j i = dij , i.e.,
Redefine the basis (scale) s.t. ||i||2 = 1
Let the transform be W
Then ||X-F||2 = || W(X-F)||2=||W(X) – W(F)||2
Now W(F)={z1,z2,…zn} and so
||W(X) – W(F)||2 = i (W(X)i – zi)2
A tutorial on synopsis construction algorithms
VLDB 2005
97
What did we achieve ?



Storing the largest coefficients is the
best solution.
Note that the fact zi=W(X)i is a
consequence of the optimization and
IS NOT a specification of the problem.
More on that later.
A tutorial on synopsis construction algorithms
VLDB 2005
98
What is the best algorithm ?



How to find the largest B coefficients of the set
{x1,x2,…} ?
Cascade Algorithm.
Recall the hierarchical nature.
A tutorial on synopsis construction algorithms
VLDB 2005
99
Cascade algorithm ?



Given a,b represent them as (a-b) and (a+b)
Divide by sqrt(2) so that the sum of squares etc…
Running time O(n)
1
A tutorial on synopsis construction algorithms
4
5
6
VLDB 2005
100
Surfing Streams


Notice that once the left half is done we only need to
remember the
A stream algorithm is natural
1
4
5
A tutorial on synopsis construction algorithms
6
VLDB 2005
101
Surfing Streams

Have an auxillary structure that maintains top B of a set
of numbers
Where else have you seen this ?
Reduce Merge Paradigm
Also used in clustering data streams
A tutorial on synopsis construction algorithms
VLDB 2005
102
In summary


Given a series of {x1,x2,…xi,…xn} in
increasing order of i we can find
(maintain) the largest B coefficients in
O(n) time and O(B+log n) space
Ok, but only for ||X-F||2
A tutorial on synopsis construction algorithms
VLDB 2005
103
Extended Histograms




What do we do in presence of multiple
dimensions/measures ?
Use multi-dim transforms
Use many 1 D transforms
Indices are large.
Correlations
Strategy: Use a Flexible scheme that allows
us to store the index and a bitmap to indicate
which measures are stored.
A tutorial on synopsis construction algorithms
VLDB 2005
104
How to solve it ?




For the basic 1-D problem we need to choose the
largest B coefficients
Use Parseval to transform error of data to
choosing/not choosing coefficients
Here we have “bags”
We can choose coefficient j with bitmap



0100 using H+S space
0101 using H+2S space
1111 using H+4S space
A tutorial on synopsis construction algorithms
VLDB 2005
105
Is 0101 better than 1100 ?
Subproblem:
Given the fact that we have settled on
choosing 2 coefficients for j, which 2 ?
It is the largest 2 again!
Basically we can choose a set of indices j
and decide how many coefficients we
choose for each j
What does this remind you of ?
A tutorial on synopsis construction algorithms
VLDB 2005
106
Knapsack



Each item j is available with M
different “versions”.
Cost of the rth version is H+rS. The
profit is an increasing function of r.
Can choose only one version.
A tutorial on synopsis construction algorithms
VLDB 2005
107
Strange roadbumps




Optimal profit + Optimal error= total energy
The relationship does not hold in
approximation.
99+1=100. Approximating 99 by 95 increases
error by 400%
We will return to this.
A tutorial on synopsis construction algorithms
VLDB 2005
108
Many questions



What do we do for other error
measures ?
What is the connection with
Histograms ?
Positives: Some direction


Cascade algorithm
Hierarchy of coefficients
A tutorial on synopsis construction algorithms
VLDB 2005
109
Non l2 errors
110
Storing coefficients is suboptimal



Recall the complicate {1,4,5,6}
We want a 1 term summary and the error is l1
What do we store ?
What is the final Result ?
{3.5,3.5,3.5,3.5}
What is the transform ?
1
4
5
6
{7,0,0,0}
But the set of coefficients available {8,?,?,?}
A tutorial on synopsis construction algorithms
VLDB 2005
111
What to do ?

Search where there is light.


Restricted problem. Useful if the synopsis
has more than one use.
Think outside the coefficients


Probabilistic Rounding
Search (cleverly) over the whole space
A tutorial on synopsis construction algorithms
VLDB 2005
112
The Best Restricted Synospis







Maximum Error.
A value (at the leaf) is affected
by only the ancestors.
# of ancestors = log n
Guess/try all of the set!
O(n) choices
Start bottom up and use a DP to
choose the best B coefficients
overall.
Works for a large number of
error measures.
A tutorial on synopsis construction algorithms
VLDB 2005
113
Analysis

At each internal node j we need to maintain the table
Error[j,Ancestor set,b]: the contribution to the
minimum error by only the subtree rooted at j when
using b or less coefficients (for the subtree)
Size of table O(n2B);
Time ~ O(n2B log B)
[depends on measure ]
But we can do better.
A tutorial on synopsis construction algorithms
VLDB 2005
114
Faster Restricted Synospis

A better cut

Number of coefficients in a subtree
is at most size+1

Size of the table storing
Err[j,Ancestor Set,b]

Remains constant as we go up the
levels!




Ancestor set decreases by 1
b takes twice as many values
O(n2) algorithm
We can also reduce the space to O(n)
A tutorial on synopsis construction algorithms
VLDB 2005
115
Thinking beyond the coefficient

Probabilistic Rounding






Start from the coefficients.
Randomly round most of them to 0
A few are rounded to non-zero values
E.g. set zi= with prob. e-W(X)i/ and 0 otherwise
Has promise (correct expectation, variance)
Two issues,
 The quality is unclear (wrt the original optimization)
 The Expected number of non-zero coefficients is B
 The variance is large, so with reasonable prob ~ 2B
A tutorial on synopsis construction algorithms
VLDB 2005
116
More exploration reqd



Interestingly the method (as proposed)
eliminates a region of search space
We can construct examples that the optimum
lies in that range.
But is an interesting method and likely (I/we
are guessing) preserves more errors than one
simultaneously (multi-criterion optimization)
A tutorial on synopsis construction algorithms
VLDB 2005
117
What is the optimum strategy






Consider the best set of coefficients
Z*={z1,z2,…zn}
“nudge” them a bit by making them multiples
of some d
The “extra error” is small (and a fn of d)
In fact each point sees § d log n
By reducing d we can get (1+) approx
A tutorial on synopsis construction algorithms
VLDB 2005
118
A straightforward idea

But we still need to find the solution
The ancestor set is unimportant – what
is important is their combined effect.
Try all possible values (multiples of d,
but we still need to fix the range)
A tutorial on synopsis construction algorithms
VLDB 2005
119
The graphs – the data
A tutorial on synopsis construction algorithms
VLDB 2005
120
The graphs … l1
A tutorial on synopsis construction algorithms
VLDB 2005
121
Relative Error (small B), Relative l1
A tutorial on synopsis construction algorithms
VLDB 2005
122
The times
A tutorial on synopsis construction algorithms
VLDB 2005
123
What have we seen so far

Wavelet representation of l_2 error


Streaming
Wavelet representation for non l_2 error



Restricted
Unrestricted
Stream
A tutorial on synopsis construction algorithms
VLDB 2005
124
A return to histograms
125
Easy relationships


A B-bucket (piecewise constant) histogram can be
represented by 2B log n Haar wavelet coefficients.
Why



Only the 2B boundary points matter
A B-term Haar wavelet synopsis can be represented
by 3B-bucket histogram.
Why

Each wavelet basis creates 3 extra pieces from 1 line
A tutorial on synopsis construction algorithms
VLDB 2005
126
Anything else ?
Totally!
We can use Wavelets to get (1+\epsilon)-approximate
V-optimal histograms.
In fact the method has advantages…
A tutorial on synopsis construction algorithms
VLDB 2005
127
Histograms, Take 5:


A B-term Histogram can be
represented by cB log n wavelet
terms.
What is we choose the largest cB log n
wavelet terms ?
A tutorial on synopsis construction algorithms
VLDB 2005
128
Need not be good.



The best histogram has the cB log n
wavelets “aligned” such that the result
is B buckets.
The best cB log n coefficients are all
over the place and give us 3cB log n
buckets.
All hope is lost ?
A tutorial on synopsis construction algorithms
VLDB 2005
129
If at first you don’t succeed…

We repeat the process and also keep the next cB log n
coefficients …

No.




But notice that the “energy” drops.
Energy = ||X||2=||W(X)||2
Basic intuition: If there were a lot of coefficients
which were large then the best V-Opt histogram
MUST have a large error.
Why?
A tutorial on synopsis construction algorithms
VLDB 2005
130
The “robust” property



Look at ||W(X)-W(H)||2=||X-H||2
W(H) has cB log n entries
If W(X) has cB-2 log n large entries ..
A tutorial on synopsis construction algorithms
VLDB 2005
131
A strange idea in 1000 words


Consider the projection to the largest
cB-2 log n wavelet terms
Is …
¼
?
A tutorial on synopsis construction algorithms
VLDB 2005
132
No. But flatten the fn
X
¼
A tutorial on synopsis construction algorithms
VLDB 2005
133
In fact

If we chose (Blog n)O(1), i.e., large,
number of coefficients then the
boundary points of the coefficients
are (approximately) good boundary
points for a VOPT histogram.
A tutorial on synopsis construction algorithms
VLDB 2005
134
The take away:



I’m ok you’re ok
If I’m not ok then you’re not ok too.
An oft repeated approximation paradigm


“if there are too many coefficients then my
algorithm is doomed – but so is anyone elses, and
therefore I am good”
“if there are not too many coefficients then we’re
good”.
A tutorial on synopsis construction algorithms
VLDB 2005
135
The Extended Wavelets in l2




We can store the largest coefficients
If there are too many coefficients which are
large then optimum error is large.
Otherwise we repeatedly take out
coefficients till taking out coefficients will
not reduce the error any more.
DP on the set of coefficients taken out.
A tutorial on synopsis construction algorithms
VLDB 2005
136
The Full Monty – update streams



So far we have been looking at X
arriving as {x1,x2,…}
What happens when X is specified by a
stream of updates ?
i.e., (i,di)=change xi to xi + di
A tutorial on synopsis construction algorithms
VLDB 2005
137
Sketches :Stream Embeddings
Basically Dimensionality reduction
To compute the histogram H of signal X


Compute embedding g(X) to fit the space
Compute H s.t. g(H) is close to g(X)
A tutorial on synopsis construction algorithms
VLDB 2005
138
Linear Embeddings

[JL Lemma ]
x


2
 Ax
2
 (1   ) x
2
A is a Random ( 2 log n)  n Matrix
drawn from Gaussian distribution.
Too many elements in matrix!
Use Pseudorandom Generators
P-Stable distribution for
A tutorial on synopsis construction algorithms
p
where p  [ 0, 2]
VLDB 2005
139
What it achieves

Computes Norm
Increasing the
coordinate is
adding the
column to
sketch.

A
x
A tutorial on synopsis construction algorithms
VLDB 2005
140
Suppose we knew the intervals
The best histogram minimizes
||X-H||2 ¼ ||AX –AH ||2

AX is a vector, AH is a
linear function of B
values
We have a min sq. error program, solvable in ptime
more involved in 1-norm.
A tutorial on synopsis construction algorithms
VLDB 2005
141
Cannot do that
||X-H||2 = ||W(X) – W(H)||2 ¼ ||AW(X) –AW(H) ||2
Idea:
Use the linear map to find the large number of
Wavelet coefficients
(top k problem using sketches)
Use similar ideas to Take 5 to get the final
solution.
A tutorial on synopsis construction algorithms
VLDB 2005
142
The return of the pink Fourier


Assuming x1,x2,…,xi,… arrive in increasing
order of i, find/maintain the top k Fourier
coefficients.
Use the strategy :




Assume that there are O(k log n) frequencies and
try to find them.
If not, we are doomed and so is everyone.
So we are ok.
For the 3rd time …
A tutorial on synopsis construction algorithms
VLDB 2005
143
What about top k


Assuming x1,x2,…,xi,… are specified by a stream of updates
find/maintain the top k values (all elements with frequency
~1/k or more).
Use the strategy :





Assume that there are O(k log n) elements and try to find
them.
If not, we are doomed and so is everyone.
So we are ok.
Again!
Use Group testing

20 questions, bit chasing – is an heavy item in the first half ?
You can use norms – or you can use collisions (hashes).
A tutorial on synopsis construction algorithms
VLDB 2005
144
From optimization to learning


We are trying to “learn” a “pure” signal
that has few coefficients…
A general paradigm.
A tutorial on synopsis construction algorithms
VLDB 2005
145
The Meaning of Life

In Summary (high level):

Approximation is very useful for synopsis construction
(the execution time speedups plus “the end use of
synopsis is approximation only”)

Synopses are usually applied on large data. Asymptotic
behaviour matters

The exact definition of the optimization is important.
How natural is natural…

Few degrees of separation between the synopsis
structures. They are related. They should be. But then
we can use algorithmic techniques back and forth
between them.
A tutorial on synopsis construction algorithms
VLDB 2005
146
The Summary (contd.)

In algorithm design terms



Most synopsis construction problems involve DP. Investigating
how to change the DP to get approximation, space efficient
algs., is often useful.
Search techniques (computation geometry) – search exponents
first are useful.
What you analyze (carefully) is often what you would get
asymptotically. The usual techniques we use for pruning etc.,
can be analyzed and and shown to be better.

Reduce-Merge ) Streaming ?

The top k in various disguises. Group testing matters.
A tutorial on synopsis construction algorithms
VLDB 2005
147
What lies ahead






Ok. So 1 D histograms have good algos.
2D?
NP-Hard.
Some approximation algorithms known.
Q: In linear time and sublinear space what
can we do ?
Sketch based results. Long way to go.
A tutorial on synopsis construction algorithms
VLDB 2005
148
What lies ahead

So 1 D Haar Wavelets have good algos (non l2).

2D?



Unlikely to be NP-Hard
Quasi-polynomial time nlog n approximation
algorithms known.
Q: In linear time and sublinear space what can
we do ?
A tutorial on synopsis construction algorithms
VLDB 2005
149
What lies ahead





So 1 D Haar Wavelets have good algos (non l2).
Non Haar ? Daubechies. Multifractals.
Unlikely to be NP-Hard
Quasi-polynomial time nlog n approximation algorithms known.
What can we do ?
A tutorial on synopsis construction algorithms
VLDB 2005
150
What lies ahead



All the update stream results are
based on l2 error because of Johnson
Lindenstrauss (and some on lp for 0<p·
2)
What about other errors ?
Will require new techniques for
streaming.
A tutorial on synopsis construction algorithms
VLDB 2005
151
Notes (not from the underground)








The VOPT definition

Poosala, Haas, Ioannidis, Shekita, SIGMOD `96.

Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, VLDB ‘98.

Guha, Koudas, Shim, STOC, ‘01.

Guha, Koudas, ICDE, ‘02.

Guha, Koudas, Shim, TODS, ‘05.

Guha, Indyk, Muthukrishnan, Strauss, ICALP, ‘02.

Guha, Shim, Woo, VLDB, ‘04.
The VOPT histogram algorithm
Take 1
Take 2
Take 3 & 4
Take 5
Relative Error Histograms
Maximum Error histograms



Nicole, J. of Parallel Distributed Computing, 1994.
(Muthukrishnan, Khanna, Skiena, ICALP, ’97),
Guha, Shim, (here) ‘05.
A tutorial on synopsis construction algorithms
VLDB 2005
152
More Notes


Range Query Histograms
 Muthukrishnan, Strauss, SODA, ‘03.
The Full Monty






Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss, STOC, ‘02.
Parseval stuff

Parseval, (margin of notebook ?), 1799.

The mandala

Gilbert, Kotidis, Muthukrishnan,Strauss, VLDB, ‘01

Gibbons, Garofalakais, SIGMOD, ’02 (also TODS, ‘04)

Garofalakis, Kumar, PODS, ‘04.
Folklore sum of squares and l2
Surfing Wavelets
Probabilistic Synopsis
Maximum error (restricted version)
A tutorial on synopsis construction algorithms
VLDB 2005
153
Notes again



Faster Restricted Synopsis

Guha, VLDB, ‘05.

Guha, Harb, KDD, ‘05 + new results
Unrestricted non l2 error
Extended Wavelets

Deligiannakis Rossopolous, SIGMOD ’03.
Guha, Kim, Shim, VLDB ’04.

Gilbert, Guha, Indyk, Muthukrishnan, Strauss, STOC, ’02

Linial, Kushilevitz, Mansour, JACM, 93

Johnson, Lindenstrauss, , ’84.





Streaming Fourier approximation
Learning Fourier Coefficients
JL Lemma
Sketches



Alon, Matias, Szegedy, JCSS, ’99.
Feigenbaum Kannan, Vishwanathan, Strauss, FOCS, ’99
Indyk, FOCS, ‘00
A tutorial on synopsis construction algorithms
VLDB 2005
154
Roads not taken





(but are relevant to synopsis)
Property Testing
Weighted sampling and SVD
Median Finding
Sampling based estimators
A tutorial on synopsis construction algorithms
VLDB 2005
155
Download