Document

advertisement
Memory Requirements
of Data Streams
Reynold Cheng
19th July, 2002
Data Stream Model
User/Application
Register Query
Results
Stream Query
Processor
Data Stream Model
User/Application
Register Query
Results
Stream Query
Processor
Scratch Space
(Memory and/or Disk)
Data
Stream
Management
System
(DSMS)
Impact of Limited Memory


Continuous streams grow unboundedly
Question: Can a continuous query be
evaluated using a finite amount of memory?
Query Model






SPJ (Select-Project-Join) Queries
L(P(S1 x S2 x … x Sn))
L: List of projected attributes
P: Selection Predicate
S1, S2,…, Sn: Input Streams
: duplicate-eliminating; ’: duplicatepreserving
Query Model: Predicate P


P: conjunction of atomic predicates
Atomic Predicate



Si.A Op Sj.B (i = j or i != j)
Si.A Op k
Op: {<, =, >}
Query Model: Attributes


All attributes have discrete, ordered domains
All attributes are of type integer
Query Model: Monotonic and
Exact Answers



The queries are monotonic
Any tuple that appears in the answer at any
point in time continues to do so forever.
The answers produced are exact i.e., no
approximation.
Motivating Examples


When does a SPJ query require only a
bounded amount of memory?
Assume 2 streams: S(A, B, C) and T(D, E)
Query 1
’A(A>10(S))



Is a simple filter on S
Tuple-at-a-time processing
No extra memory for saving stream tuples
Query 1 (again)
A(A>10(S))



Keep track of each distinct value of A > 10
To eliminate duplicates
Requires unbounded memory
Query 2
A(A=D(S x T))



Save each tuple s in S, since s may join with
any tuples in T that arrive in the future
Also need to save each tuple in T
Requires unbounded memory
Query 3
’A(A=D ^ A > 10 ^ D < 20 (S x T))


Can be evaluated with bounded memory!
For each integer v in [11,19], keep:




current # of tuples in S with A = v (v.S)
current # of tuples in T with D = v (v.T)
For an incoming tuple from stream S with A =
v, output v for v.T times.
For an incoming tuple from stream T with D =
v, output v for v.S times.
Query 4
A(B < D ^ A > 10 ^ A < 20 (S x T))


Can be evaluated with bounded memory!
For each integer v in [11,19], keep:




Current min value of B among all tuples in S with A = v
(v.B_min)
Current max value of D among all tuples in T with A = v
(v.D_max)
For an incoming tuple from stream S, check:
S.B < v.D_max.
For an incoming tuple from stream T, check:
T.D > v.B_min.
Our Goals




Some queries involving join can be answered
using finite amount of memory
Under what conditions can a query be
answered using finite memory over all
possible instances of streams?
Identify a class of queries that can be
evaluated with a bounded amount of memory
Consider both duplicate-preserving and
duplicate-eliminating projections
Bounded Memory
Computability

An SPJ query is bounded-memory
computable if we can find:



A constant M and,
An algorithm that evaluates the query using fewer
than M units of memory
A unit of memory can store one attribute
value or a count
Memory boundness testing
(Outline)


Rewrite Q as a union of Locally Totally
Ordered Queries (LTO queries), based on
the following theorem:
Any query Q can be rewritten as Q1 U Q2
U … U Qm, where each Qi is an LTO query
and unions are duplicate-preserving
Memory boundness testing
(Outline)


Q is computable in bounded memory if and
only if all its decomposed LTO queries are
computable in bounded memory.
Key: Develop a theorem for checking the
memory boundness of a LTO query
Definitions






Can write a query Q as Q(P), when only P is
important to discussions
Element: constant / attribute of a stream
C(Q): Set of constants appearing in Q
S(Q): Set of streams appearing in Q
A(S): Set of attributes in stream S
(S) = A(S) U C(Q): set of elements in Q
potentially relevant to stream S
Total Ordering

P+: transitive closure of P


A = 10 and B < A => B < 10
Set of elements E is totally ordered by a set
of predicates P if


for any elements e1 and e2 in E,
Exactly one of e1<e2, e1=e2, or e1>e2 is in P+
Order-Inducing Predicates



A set of predicates P is order-inducing if a set
of elements E is totally ordered by P.
TO(E): Set of all order-inducing sets of
predicates for elements E
Example



E = {A,B,5}
Order-inducing sets: {A < B, 5 < A}, {A = B, B < 5}.
These two sets belong to TO(E).
LTO Query

A query Q(P) is LTO query if for every S 
S(Q), (S) is totally ordered by P


Where S(Q): Set of streams appearing in Q
(S) = A(S) U C(Q): set of elements in Q
potentially relevant to stream S
Decomposition of a Query into
LTO Queries

Q = ’L(P(S1 x S2 x … x Sn)) can be
decomposed into Q1 U Q2 U … U Qm,
where Qi is an LTO query.
 Let TO((Si)) = {Ti1, Ti2,…, Timi},

Tij is a local total ordering – one possible
ordering of Si and query constants.
Decomposition Theorem

An exhaustive union of m1 x m2 x mn queries:
Q(P U T11 U T21 U … U Tn1) U
Q(P U T12 U T21 U … U Tn1) U
…U
Q(P U T11 U T22 U … U Tn1) U
Q(P U T12 U T22 U … U Tn1) U
…U
Q(P U T1m1 U T2m2 U … U Tnmn)
Boundedness of Attributes




An attribute A is lower-bounded by a set of
predicates P if there is an atomic predicate
A > k  P+ for some constant k
An attribute A is upper-bounded by a set of
predicates P if there is an atomic predicate
A < k  P+ for some constant k
An atomic attribute is bounded if it is both upperbounded and lower-bounded
An attribute is unbounded if it is not bounded
MaxRef and MinRef


MaxRef(Si) is the set of all unbounded
attributes A of Si that participate in join of the
form Sj.B < Si.A, i  j
MinRef(Si) is the set of all unbounded
attributes A of Si that participate in join of the
form Si.A < Sj.B, i  j
Boundness Testing of LTO
Query (Duplicate-Preserving)

1.
2.
3.
Let Q = ’L(P(S1 x S2 x … x Sn)) be an LTO
query where n > 1. Q is bounded memory
computable iff:
Every attribute in L is bounded
For every equality join predicate Si.A=Sj.B
where i  j, Si.A and Sj.B are both bounded
|MaxRef(Si)| = |MinRef(Si)| = 0 for i = 1,…,n
Proof Outline




We create synopses of n data streams
For each stream Si, partition the tuples that
satisfy the total order condition on Si into
distinct buckets based on the values of the
bounded attributes
Tuples with the same values of all bounded
attributes are placed in the same bucket
Tuples that differ on at least one bounded
attribute are placed in different buckets
Proof Outline (2)

For each bucket, store




The values for the bounded attributes
Total number of tuples falling into the bucket
The total size of these synopses is bounded
by constant
Can be proved that if the 3 conditions are
satisfied, these synopses suffice to evaluate
the query.
Proof (If):




Attributes not in project list / join condition can be
ignored, because all local selection conditions must
be implied by the total order
C1 and C2 guarantee that all attributes in equijoin
and project list are bounded
Each synopsis maintains full information about the
values of all bounded attributes for every tuple
Equijoin and projections can be handled properly
Proof (If):

1.
2.

C3 asserts For every Si.A < Sj.B, i  j, either one is
true:
Both attributes are bounded --- Synopses have full
information for A & B
The total order implies Si.A < c < Sj.B where c is
some constant in Q --- No need to store actual
attribute values, since all tuples from Si that satisfy
the total order on Si join with all tuples from Sj that
satisfy the total order on Sj
No relevant information is lost by consolidating
tuples into buckets
Proof (Only if):


If one of the conditions C1, C2 and C3 does
not hold, then Q cannot always be evaluated
in bounded memory.
For each condition, if the condition is violated,
one can construct instances and
presentations of the input streams that
require more than M memory to correctly
evaluate Q.
Boundness Testing of LTO
Query (Duplicate-Eliminating)

Let Q = L(P(S1 x S2 x … x Sn)) be an LTO
query where n > 1. Q is bounded memory
computable iff:




Every attribute in L is bounded
For every equality join predicate Si.A=Sj.B where
i  j, Si.A and Sj.B are both bounded
|MaxRef(Si)|eq + |MinRef(Si)|eq  1 for i = 1,…,n
|E|eq denotes the number of P-induced
equivalence classes in the element set E.
Comments




Q is computable in bounded memory if and only if all
its decomposed LTO queries are computable in
bounded memory.
The checking algorithm needs O(|A(Q)|4) times.
If a query is not memory-bounded computable,
approximation algorithms are necessary.
For a memory-bounded computable query,
computation is costly in the evaluation of each LTO
query. Need to evaluate the query for every arrival
of tuples.
Comments




Only spatial requirement, not query speed, is
considered.
Can we extend the memory-boundness checking to
general (including aggregate) queries?
It is not always possible to provide exact answers to
queries, so approximation query algorithms have to
be developed.
For these approximation queries, need to consider
memory-boundness, execution speed and
approximation quality.
Outline of this Talk



Memory Requirements of Queries
Approximation Queries
Other Research Issues
Approximate Query Evaluation






Why?
Handling load – streams coming too fast
Data stream is archived in a off-site data
warehouse, expensive access of archived
data
Avoid unbounded storage and computation
Ad hoc queries need approximate history
Try to look at the data items only once and
in a fixed order
Approximate Query Evaluation


How? Sliding windows, synopsis, samples
Major Issues?






Metric for set-valued queries
Composition of approximate operators
How is it understood/controlled by user?
Integrate into query language
Query planning and interaction with resource
allocation
Accuracy-efficiency-storage tradeoff and global
metric
Synopses



Queries may access or aggregate past data
Need bounded-memory history-approximation
Synopsis?



Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is
unavailable
Examples





Sliding Windows
Samples
Sketches
Histograms
Wavelet representation
Sketching Techniques






Self-Join Size Estimation
Stream of values from D = {1,2,…,n}
Let fi = frequency of value i
Consider S = Σ fi2, or Gini’s index of
homogeneity.
Useful in parallel DB applications, error
estimation in query result size estimation and
access plan costs.
Equivalent query: count (R |><|D R)
Evaluating S = Σ fi2



To update S, keep a counter fi for each value
i in the domain D  (n) space
Has to be kept for each self-join
Question – estimating S in sub-linear space?
(O(log n))
Self-Join Size Estimation

AMS Technique (randomized sketches)
 Given (f1,f2,…,fN)
 Zi = random{-1,1}
 X = Σ fiZi (X incrementally computable)

Theorem Exp[X2] = Σ fi2




Cross-terms fiZi fjZj have 0 expectation
Square-terms fiZi fiZi = fi2
Space = log (N + Σ fi)
Independent samples Xk reduce variance
Estimation Quality



How can independent samples Xk improve
the quality of estimation?
Keep s1 x s2 samples for Xk
s1 reduces variance, s2 boosts confidence
s1
Atomic
Sketch
Avg(X1j2)
Avg(X2j2)
s2
Avg(X3j2)
Avg(X4j2)
Avg(X5j2)
Median
(sketch)
Sample Run of AMS
Z1 =
V =
3
1
-1 1
1
Σvi2 = 123
Z1 =
6
2
-1
Z2 =
X1= 5, X12 = 25
V =
4
1
-1 1
1
5
6
-1
7
-1 1
-1 1
X2= 14, X22 = 196
2
5
Z2 =
1
Est = 110.5
7
-1 1
-1 1
1
Σ vi2 = 130, X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90
Comments on AMS



The self-join size can be computed on-line
Sufficiently small variance (controlled by s1 and s2)
Can this method be extended to answer other
queries?
Complex Aggregate Queries





A. Dobra et al. extend the idea of AMS to provide
approximate answers to complex aggregate queries.
SELECT AGG FROM R1,R2,…,Rr where E
AGG: COUNT/SUM/AVERAGE
E: conjunction of (Ri.Aj = Rk.Al)
It is proved that the error of these estimates is at
most ε with probability 1-δ.
Basic Notions of
Approximation

For aggregate queries (e.g., SUM, COUNT), approximation
quality can be measured by relative error:



(Estimated value – Actual value) / Actual value
Open question: for queries involving more than simple
aggregation, how should we define approximation?
Consider S |><|BT: (S: {A,B}, T: {B,C})
A
B
C
A
B
C
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of
Approximation (2)

Can we accept this kind of approximation?
A
B
C
A
B
C
10
20.5
Doctor
11
21.6
Doctor
8
10.3
Lawyer
8
10.3
Student
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of
Approximation (3)

Can we provide useful (semantically correct) but stale results?
A
B
C
A
B
C
10
20.5
Doctor
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
Actual Result (at time t)
Approximate Result
(correct result at time t - )
Outline of this Talk




An Overview of Streams
Data and Query Model
Approximation Queries
Other Research Issues
Data Mining

High-Speed Stream Data Mining





Association Rules
Stream Clustering
Decision Trees
Single-pass algorithms for infering
interesting patterns on-line (as the data
stream arrives)
Useful for mission-critical tasks like telecom
fraud detection
Conclusion: Future Work

Query Processing




Runtime Management





Stream Algebra and Query Languages
Approximations
Blocking Operators, Constraints, Punctuations
Scheduling, Memory Management, Rate Management
Query Optimization (Adaptive, Multi-Query, Ad-hoc)
Distributed processing
Synopses and Algorithmic Problems
Systems


UI, statistics, crash recovery and transaction management
System development and deployment
References
B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom.
Models and Issues in Data Stream Systems,
PODS ’02. (Paper and Talk Slides)
A. Arasu, B. Babcock, S. Babu, J. McAlister, J. Widom.
Characterizing Memory Requirements for Queries
over Continuous Data Streams, PODS ’02.
A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi.
Processing Complex Aggregate Queries over Data
Streams, SIGMOD ’02.
N. Alon, Y. Matias, M. Szegedy. The Space
Complexity of Approximating the Frequency
Moments, STOC ’96.
Thank You!
Download