CS240B Notes
By
Carlo Zaniolo
CSD--UCLA
________________________________________
* Notes based on a VLDB’02 tutorial by Minos
Garofalakis, Johannes Gehrke, and Rajeev Rastogi
1
Synopsis: bounded-memory history-approximation
Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is unavailable
Examples
Sliding Windows
Samples
Histograms
Wavelet representation
Sketching techniques
Approximate Algorithms: e.g., median, quantiles,…
Fast and light Data Mining algorithms
2
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
3
• Idea: A small random sample S of the data often wellrepresents all the data
– For a fast approx answer, apply “modified” query to S
– Example: select agg from R where odd(R.e)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)
Sample S: 9 5 1 8
– If agg is avg , return average of odd elements in S answer: 5
– If agg is count , return average over all elements e in S of
• 1 if e is odd answer: 12*3/4 =9
• 0 if e is even
Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer
Garofalakis, Gehrke, Rastogi, VLDB’02 #
4
Example: Actual answer is within 5 ± 1 with prob 0.9
Use Tail Inequalities to give probabilistic bounds on returned answer
Markov Inequality
Chebyshev’s Inequality
Hoeffding’s Inequality
Chernoff Bound
5
Reservoir Sampling [Vit85]: Maintains a sample S having a preassigned size M on a stream of arbitrary size
Add each new element to S with probability M/n, where n is the current number of stream elements
If add an element, evict a random element from S
Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to
S
Concise sampling [GM98]: Duplicates in sample S stored as
<value, count> pairs (thus, potentially boosting actual sample size)
Counting Samples [GM98]: for answering hot list queries (k most frequent values)
Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.
6
Given a complex Query graph how to use/manage the sampling process [BDM04]
More about this later [LawZ02]
7
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
8
Histograms approximate the frequency distribution of element values in a stream
A histogram (typically) consists of
A partitioning of element domain values into buckets
A count per bucket B (of the number of elements in B)
Widely used in DBMS query optimization
Many Types of Proposed:
Equi-Depth Histograms: select buckets such that counts per bucket are equal
V-Optimal Histograms: select buckets to minimize frequency variance within buckets
Wavelet-based Histograms
9
• Equi-Depth Histograms
– Idea: Select buckets such that counts per bucket are equal
Count for bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values
• V-Optimal Histograms [IP95] [JKM98]
– Idea: Select buckets to minimize frequency variance within buckets minimize
B v
B
( f
v
C
V
B
B )
2
Count for bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values
Garofalakis, Gehrke, Rastogi, VLDB’02 #
10
Equi-Depth Histogram Construction
For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b
Example: (n=12, b=4)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3
(.25-quantile) rank = 6
(.5-quantile) rank = 9
(.75-quantile)
11
• (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation
• Example: select count(*) from R where 4 <= R.e <= 15
Count spread evenly among bucket values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4
R.e
15 answer: 3.5 *
C
B
• For equi-depth histograms, maximum error:
2 * C
B
Garofalakis, Gehrke, Rastogi, VLDB’02 #
12
Quantiles Using Samples
Quantiles from Synopses
One pass algorithms for approximate samples …
Much work in this area … omitted
13
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
14
• Wavelets : Mathematical tool for hierarchical decomposition of functions/signals
• Haar wavelets : Simplest wavelet basis, easy to understand and implement
– Recursive pairwise averaging and differencing at different resolutions
Resolution Averages Detail Coefficients
3 [2, 2, 0, 2, 3, 5, 4, 4] ----
2
1
0
[2, 1, 4, 4]
[1.5, 4]
[2.75]
[0, -1, -1, 0]
[0.5, 0]
[-1.25]
Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
Garofalakis, Gehrke, Rastogi, VLDB’02 #
15
• Hierarchical decomposition structure
(a.k.a. “error tree”)
2.75
+
-1.25
Coefficient “Supports”
2.75
+
+
-
-1.25
+
-
+
0.5
-
+
0
+
0
+
-1
+
-1
+
-
0
-
2 2 0 2 3 5 4 4
Original frequency distribution
0.5
0
+
-
+
-
0
-1
-1
0
+
-
+
-
+
-
+
-
Garofalakis, Gehrke, Rastogi, VLDB’02 #
16
Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution
Steps
Compute cumulative frequency distribution C
Compute linear wavelet transform of C
Greedy heuristic methods
Retain coefficients leading to large error reduction
Throw away coefficients that give small increase in error
17
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
18
Conventional data summaries fall short:
Quantiles and 1-d histograms: Cannot capture attribute correlations
Samples (e.g., using Reservoir Sampling) perform poorly for joins
Multi-d histograms/wavelets: Construction requires multiple passes over the data
Different approach: Randomized sketch synopses
Only logarithmic space
Probabilistic guarantees on the quality of the approximate answer
Can handle extreme cases.
19
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
20
When input stream rate exceeds system capacity a stream manager can shed load (tuples)
Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss
Introducing load shedding in a data stream manager is a challenging problem
Random load shedding or semantic load shedding
21
QoS for each application as a function relating output to its utility
– Delay based, drop based, value based
Techniques for introducing load shedding operators in a plan such that QoS isdisrupted the least
– Determining when, where and how much load to shed
22
Formulate load shedding as an optimization problem for multiple sliding window aggregate queries
– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate
Consider placement of load shedding operators in query plan
– Each operator sheds load uniformly with probability pi
23
[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”,
Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms , p.633–634, 2002 .
[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication.
[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.
[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving
Approximate Query Answers”. ACM SIGMOD 1998.
[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.
[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and
Mining Queries on Data Streams under Load Shedding. International Journal of Business
Intelligence and Data Mining, 2008.
24