Sliding

Approximation and Load Shedding for QoS in DSMS*

CS240B Notes

By

Carlo Zaniolo

CSD--UCLA

________________________________________

* Notes based on a VLDB’02 tutorial by Minos

Garofalakis, Johannes Gehrke, and Rajeev Rastogi

1

Synopses and Approximation

 Synopsis: bounded-memory history-approximation

 Succinct summary of old stream tuples

 Like indexes/materialized-views, but base data is unavailable

 Examples

 Sliding Windows

 Samples

 Histograms

 Wavelet representation

 Sketching techniques

 Approximate Algorithms: e.g., median, quantiles,…

 Fast and light Data Mining algorithms

2

Overview of Stream Synopses

 Windows: logical, physical (covered)

 Samples: Answering queries using samples

 Histograms: Equi-depth histograms, On-line quantile computation

 Wavelets: Haar-wavelet histogram construction & maintenance

3

Sampling: Basics

• Idea: A small random sample S of the data often wellrepresents all the data

– For a fast approx answer, apply “modified” query to S

– Example: select agg from R where odd(R.e)

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)

Sample S: 9 5 1 8

– If agg is avg , return average of odd elements in S answer: 5

– If agg is count , return average over all elements e in S of

• 1 if e is odd answer: 12*3/4 =9

• 0 if e is even

Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer

Garofalakis, Gehrke, Rastogi, VLDB’02 #

4

Probabilistic Guarantees

 Example: Actual answer is within 5 ± 1 with prob  0.9

 Use Tail Inequalities to give probabilistic bounds on returned answer

 Markov Inequality

 Chebyshev’s Inequality

 Hoeffding’s Inequality

 Chernoff Bound

5

Sampling—some background

 Reservoir Sampling [Vit85]: Maintains a sample S having a preassigned size M on a stream of arbitrary size

 Add each new element to S with probability M/n, where n is the current number of stream elements

 If add an element, evict a random element from S

 Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to

S

 Concise sampling [GM98]: Duplicates in sample S stored as

<value, count> pairs (thus, potentially boosting actual sample size)

 Counting Samples [GM98]: for answering hot list queries (k most frequent values)

 Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

6

Load Shedding Using Samples

 Given a complex Query graph how to use/manage the sampling process [BDM04]

 More about this later [LawZ02]

7

Overview





 Sketches

8

Histograms

 Histograms approximate the frequency distribution of element values in a stream

 A histogram (typically) consists of

 A partitioning of element domain values into buckets

 A count per bucket B (of the number of elements in B)

 Widely used in DBMS query optimization

Many Types of Proposed:

 Equi-Depth Histograms: select buckets such that counts per bucket are equal

 V-Optimal Histograms: select buckets to minimize frequency variance within buckets

 Wavelet-based Histograms

9

Types of Histograms

• Equi-Depth Histograms

– Idea: Select buckets such that counts per bucket are equal

Count for bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values

• V-Optimal Histograms [IP95] [JKM98]

– Idea: Select buckets to minimize frequency variance within buckets minimize

 

B v



B

( f

 v

C

V

B

B )

2

Count for bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values


10

Equi-Depth Histogram Construction

 For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b

 Example: (n=12, b=4)

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3

(.25-quantile) rank = 6

(.5-quantile) rank = 9

(.75-quantile)

11

Answering Queries Histograms [IP99]

• (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation

• Example: select count(*) from R where 4 <= R.e <= 15

Count spread evenly among bucket values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

4



R.e



15 answer: 3.5 *

C

B

• For equi-depth histograms, maximum error: 

2 * C

B


12

Approximate Algorithms

 Quantiles Using Samples

 Quantiles from Synopses

 One pass algorithms for approximate samples …

 Much work in this area … omitted

13

Overview





 Sketches

14

One-Dimensional Haar Wavelets

• Wavelets : Mathematical tool for hierarchical decomposition of functions/signals

• Haar wavelets : Simplest wavelet basis, easy to understand and implement

– Recursive pairwise averaging and differencing at different resolutions

Resolution Averages Detail Coefficients

3 [2, 2, 0, 2, 3, 5, 4, 4] ----

2

1

0

[2, 1, 4, 4]

[1.5, 4]

[2.75]

[0, -1, -1, 0]

[0.5, 0]

[-1.25]

Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]


15

Haar Wavelet Coefficients

• Hierarchical decomposition structure

(a.k.a. “error tree”)

2.75

+

-1.25

Coefficient “Supports”

2.75

+

+

-

-1.25

+

-

+

0.5

-

+

0

+

0

+

-1

+

-1

+

-

0

-

2 2 0 2 3 5 4 4

Original frequency distribution

0.5

0

+

-

+

-

0

-1

-1

0

+

-

+

-

+

-

+

-


16

Compressed Wavelet Representations

Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution

Steps

 Compute cumulative frequency distribution C

 Compute linear wavelet transform of C

 Greedy heuristic methods

 Retain coefficients leading to large error reduction

 Throw away coefficients that give small increase in error

17

Overview







Sketches

18

Sketches

 Conventional data summaries fall short:

 Quantiles and 1-d histograms: Cannot capture attribute correlations

 Samples (e.g., using Reservoir Sampling) perform poorly for joins

 Multi-d histograms/wavelets: Construction requires multiple passes over the data

 Different approach: Randomized sketch synopses

 Only logarithmic space

 Probabilistic guarantees on the quality of the approximate answer

 Can handle extreme cases.

19

Overview





 Sketches



QoS by load shedding.

20

QoS and Load Schedding

 When input stream rate exceeds system capacity a stream manager can shed load (tuples)

 Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss

 Introducing load shedding in a data stream manager is a challenging problem

 Random load shedding or semantic load shedding

21

Load Shedding in Aurora

 QoS for each application as a function relating output to its utility

– Delay based, drop based, value based

 Techniques for introducing load shedding operators in a plan such that QoS isdisrupted the least

– Determining when, where and how much load to shed

22

Load Shedding in STREAM

 Formulate load shedding as an optimization problem for multiple sliding window aggregate queries

– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate

 Consider placement of load shedding operators in query plan

– Each operator sheds load uniformly with probability pi

23

References

[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”,

Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms , p.633–634, 2002 .

[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication.

[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.

[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving

Approximate Query Answers”. ACM SIGMOD 1998.

[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.

[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and

Mining Queries on Data Streams under Load Shedding. International Journal of Business

Intelligence and Data Mining, 2008.

24

Sliding

Approximation and Load Shedding for QoS in DSMS*

Synopses and Approximation

Overview of Stream Synopses

Sampling: Basics

Probabilistic Guarantees

Sampling—some background

Load Shedding Using Samples

Overview

Histograms

Types of Histograms

Answering Queries Histograms [IP99]

Approximate Algorithms

Overview

One-Dimensional Haar Wavelets

Haar Wavelet Coefficients

Compressed Wavelet Representations

Overview

Sketches

Sketches

Overview

QoS by load shedding.

QoS and Load Schedding

Load Shedding in Aurora

Load Shedding in STREAM

References

Related documents

Products

Support

Sliding

Approximation and Load Shedding for QoS in DSMS*

Synopses and Approximation

Overview of Stream Synopses

Sampling: Basics

Probabilistic Guarantees

Sampling—some background

Load Shedding Using Samples

Overview

Histograms

Types of Histograms

Answering Queries Histograms [IP99]

Approximate Algorithms

Overview

One-Dimensional Haar Wavelets

Haar Wavelet Coefficients

Compressed Wavelet Representations

Overview

Sketches

Sketches

Overview

QoS by load shedding.

QoS and Load Schedding

Load Shedding in Aurora

Load Shedding in STREAM

References

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib