Sliding

advertisement

Approximation and Load Shedding for QoS in DSMS*

CS240B Notes

By

Carlo Zaniolo

CSD--UCLA

________________________________________

* Notes based on a VLDB’02 tutorial by Minos

Garofalakis, Johannes Gehrke, and Rajeev Rastogi

1

Synopses and Approximation

 Synopsis: bounded-memory history-approximation

 Succinct summary of old stream tuples

 Like indexes/materialized-views, but base data is unavailable

 Examples

 Sliding Windows

 Samples

 Histograms

 Wavelet representation

 Sketching techniques

 Approximate Algorithms: e.g., median, quantiles,…

 Fast and light Data Mining algorithms

2

Overview of Stream Synopses

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

3

Sampling: Basics

• Idea: A small random sample S of the data often wellrepresents all the data

– For a fast approx answer, apply “modified” query to S

– Example: select agg from R where odd(R.e)

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)

Sample S: 9 5 1 8

– If agg is avg , return average of odd elements in S answer: 5

– If agg is count , return average over all elements e in S of

• 1 if e is odd answer: 12*3/4 =9

• 0 if e is even

Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer

Garofalakis, Gehrke, Rastogi, VLDB’02 #

4

Probabilistic Guarantees

 Example: Actual answer is within 5 ± 1 with prob  0.9

 Use Tail Inequalities to give probabilistic bounds on returned answer

 Markov Inequality

 Chebyshev’s Inequality

 Hoeffding’s Inequality

 Chernoff Bound

5

Sampling—some background

 Reservoir Sampling [Vit85]: Maintains a sample S having a preassigned size M on a stream of arbitrary size

 Add each new element to S with probability M/n, where n is the current number of stream elements

 If add an element, evict a random element from S

 Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to

S

 Concise sampling [GM98]: Duplicates in sample S stored as

<value, count> pairs (thus, potentially boosting actual sample size)

 Counting Samples [GM98]: for answering hot list queries (k most frequent values)

 Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

6

Load Shedding Using Samples

 Given a complex Query graph how to use/manage the sampling process [BDM04]

 More about this later [LawZ02]

7

Overview

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

 Sketches

8

Histograms

 Histograms approximate the frequency distribution of element values in a stream

 A histogram (typically) consists of

 A partitioning of element domain values into buckets

 A count per bucket B (of the number of elements in B)

 Widely used in DBMS query optimization

Many Types of Proposed:

 Equi-Depth Histograms: select buckets such that counts per bucket are equal

 V-Optimal Histograms: select buckets to minimize frequency variance within buckets

 Wavelet-based Histograms

9

Types of Histograms

• Equi-Depth Histograms

– Idea: Select buckets such that counts per bucket are equal

Count for bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values

• V-Optimal Histograms [IP95] [JKM98]

– Idea: Select buckets to minimize frequency variance within buckets minimize

 

B v

B

( f

 v

C

V

B

B )

2

Count for bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values

Garofalakis, Gehrke, Rastogi, VLDB’02 #

10

Equi-Depth Histogram Construction

 For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b

 Example: (n=12, b=4)

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3

(.25-quantile) rank = 6

(.5-quantile) rank = 9

(.75-quantile)

11

Answering Queries Histograms [IP99]

• (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation

• Example: select count(*) from R where 4 <= R.e <= 15

Count spread evenly among bucket values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

4

R.e

15 answer: 3.5 *

C

B

• For equi-depth histograms, maximum error: 

2 * C

B

Garofalakis, Gehrke, Rastogi, VLDB’02 #

12

Approximate Algorithms

 Quantiles Using Samples

 Quantiles from Synopses

 One pass algorithms for approximate samples …

 Much work in this area … omitted

13

Overview

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

 Sketches

14

One-Dimensional Haar Wavelets

• Wavelets : Mathematical tool for hierarchical decomposition of functions/signals

• Haar wavelets : Simplest wavelet basis, easy to understand and implement

Recursive pairwise averaging and differencing at different resolutions

Resolution Averages Detail Coefficients

3 [2, 2, 0, 2, 3, 5, 4, 4] ----

2

1

0

[2, 1, 4, 4]

[1.5, 4]

[2.75]

[0, -1, -1, 0]

[0.5, 0]

[-1.25]

Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

Garofalakis, Gehrke, Rastogi, VLDB’02 #

15

Haar Wavelet Coefficients

• Hierarchical decomposition structure

(a.k.a. “error tree”)

2.75

+

-1.25

Coefficient “Supports”

2.75

+

+

-

-1.25

+

-

+

0.5

-

+

0

+

0

+

-1

+

-1

+

-

0

-

2 2 0 2 3 5 4 4

Original frequency distribution

0.5

0

+

-

+

-

0

-1

-1

0

+

-

+

-

+

-

+

-

Garofalakis, Gehrke, Rastogi, VLDB’02 #

16

Compressed Wavelet Representations

Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution

Steps

 Compute cumulative frequency distribution C

 Compute linear wavelet transform of C

 Greedy heuristic methods

 Retain coefficients leading to large error reduction

 Throw away coefficients that give small increase in error

17

Overview

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

Sketches

18

Sketches

 Conventional data summaries fall short:

 Quantiles and 1-d histograms: Cannot capture attribute correlations

 Samples (e.g., using Reservoir Sampling) perform poorly for joins

 Multi-d histograms/wavelets: Construction requires multiple passes over the data

 Different approach: Randomized sketch synopses

 Only logarithmic space

Probabilistic guarantees on the quality of the approximate answer

 Can handle extreme cases.

19

Overview

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

 Sketches

QoS by load shedding.

20

QoS and Load Schedding

 When input stream rate exceeds system capacity a stream manager can shed load (tuples)

 Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss

 Introducing load shedding in a data stream manager is a challenging problem

 Random load shedding or semantic load shedding

21

Load Shedding in Aurora

 QoS for each application as a function relating output to its utility

– Delay based, drop based, value based

 Techniques for introducing load shedding operators in a plan such that QoS isdisrupted the least

– Determining when, where and how much load to shed

22

Load Shedding in STREAM

 Formulate load shedding as an optimization problem for multiple sliding window aggregate queries

– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate

 Consider placement of load shedding operators in query plan

– Each operator sheds load uniformly with probability pi

23

References

[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”,

Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms , p.633–634, 2002 .

[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication.

[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.

[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving

Approximate Query Answers”. ACM SIGMOD 1998.

[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.

[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and

Mining Queries on Data Streams under Load Shedding. International Journal of Business

Intelligence and Data Mining, 2008.

24

Download