Data Streams • Definition: Data arriving continuously, usually just by insertions of new elements. The size of the stream is not known a-priori, and may be unbounded. • Hot research area 1 Data Streams • Applications: – – – – – Phone call records in AT&T Network Monitoring Financial Applications (Stock Quotes) Web Applications (Data Clicks) Sensor Networks 2 Continuous Queries • Mainly used in data stream environments • Defined once, and run until user terminates them • Example: Give me the names of the stocks who increased their value by at least 5% over the last hour 3 What is the Problem (1)? • Q is a selection: then size(A) may be unbounded. Thus, we cannot guarantee we can store it. 4 What is the Problem (2)? • Q is a self-join: If we want to provide only NEW results, then we need unlimited storage to guarantee no duplicates exist in result 5 What is the Problem (3)? • Q contains aggregation: then tuples in A might be deleted by new observed tuples. Ex: Select A, sum(B) From Stream X What if B < 0 ? Group by A Having sum(B) > 100 6 What is the Problem (4)? • What if we can delete tuples in the Stream? • What if Q contains a blocking operator near the top (example: aggregation)? • Online Aggregation Techniques useful 7 Global Architecture 8 Related Areas • Data Approximation: limits size of scratch, store • Grouping Continuous Queries submitted over the same sources • Adaptive Query Processing (data sources may be providing elements with varying rates) • Partial Results: Give partial results to the user (the query may run forever) • Data Mining: Can the algorithms be modified to use one scan of the data and still provide good results? 9 Initial Approaches (1) • Typical Approach: Limit expressiveness of query language to limit size of Store, Scratch • Alert (1991): – Triggers on Append-Only (Active) Tables – Event-condition-action triggers • Event: Cursor on Active Table • Condition: From and Where Clause of Rule • Action: Select Clause of Rule (typically called a function) – Triggers were expressed as continuous queries – User was responsible for monitoring the size of tables 10 Initial Approaches (2) • Tapestry (1992): – Introduced Notion of Continuous Queries – Used subset of SQL (TQL) – Query Q was converted to the Minimum Monotone Bounding Query QM(t)= Union QM(τ) , for all τ <= t – Then QM was converted to an Incremental query QI. – Problems: • Duplicate tuples were returned • Aggregation Queries were not supported • No Outer-Joins allowed 11 Initial Approaches (3) • Chronicle Data Model (1995): – Data Streams referred as Chronicles (append-only) – Assumptions: • A new tuple is not joined with previously seen tuples. • At most a constant number of tuples from a relation R can join with Chronicle C. – Achievement: Incremental maintenance of views in time independent of the Chronicle size 12 Materialized Views • Work on Self-Maintenance: important to limit size of Scratch. If a view can be selfmaintainable, any auxiliary storage much occupy bounded space • Work on Data Expiration: important for knowing when to move elements from Scratch to Throw. 13 Data Approximation • Area most working is being done nowadays • Problem: We cannot have O(N) space/time cost per element to solve a problem, but want solutions close to O(poly(logN)). • Sampling • Histograms • Wavelets • Sketching Techniques 14 Sampling • Easiest one to implement, use • Reservoir Sampling: dominant algorithm • Used for any problem (but with serious limitations, especially in cases of joins) • Stratified Sampling (sampling data at different rates) – Reduce variance in data – Reduce error in Group-By Queries. 15 Histograms • V-Optimal: – Gilbert et al. removed sorted restriction: time/space using sketches in O(poly(B,logN,1/ε) • Equi-Width: – Compute quantiles in O(1/ε logεN) space and precision of εN. • Correlated Aggregates: – – – – AGG-D{Y : Predicate(X, AGG-I(X)) } AGG-D : Count or Sum AGG-I : Min, Max or Average Reallocate histogram based on arriving tuples, and the AGG-I (if we want min, and are storing [min, min + ε] in the histogram and receive new min, throw away previous histogram. 16 Wavelets • Used for Signal Decomposition (good if measured aggregate follows a signal) • Matias, Vitter: Incremental Maintenance of top Wavelet coefficients • Gilbert et al: Point, Range Queries with wavelets 17 Sketching Techniques (1) • Main idea: If getting the exact value of a variable V requires O(n) time, then use approximation: • Define a random variable R with expected value equal to that of V, and small variance. • Example (self-join): – Select 4-wise independent variables ξi (i = 1, …, dom(A)) – Define Z = X2, X = Σf(i)ξ(i) , f(i): frequency of i-th value – Result is median of s2 variables Yj, where Yj is the average of s1 variables (boosting accuracy, confidence interval) 18 Sketching Techniques (2) • Answer Complex Aggregate Queries d • Frequency moments Fk, where Fk mik i 1 capture statistics of data – – – – mi: the frequency of occurrence for value i F0: number of distinct values F1: number of total elements F2: Gini index (useful for self-joins) • L1, L2 norms of a vector computes similarly to F2 • Quantiles (combination of histograms, sketches) 19 Grouping Continuous Queries • Goal: Group similar queries over data sources, to eliminate common processing needed and minimize response time and storage needed. • Niagara (joint work of Wisconsin, Oregon) • Tukwilla (Washington) • Telegraph (Berkeley) 20 Niagara (1) • Supports thousands of queries over XML sources. • Features: – Incremental Grouping of Queries – Supports queries evaluated when data sources change (change-based) – Supports queries evaluated at specific intervals (timerbased) • Timer-based queries are harder to group because of overlapping time intervals • Change-based queries have better response times but waste more resources. 21 Niagara (2) • Why Group Queries: – Share Computation – Test multiple “Fire” actions together – Group plans can be kept in memory more easily 22 Niagara - Key ideas • Query Expression Signature • Query Plan (generated by Niagara parser) 23 Group • Group signature (common signature of all queries in a plan) • Group Constant Table (signature constants of all the queries in the group, and destination buffers) • Group plan: the query plan shared by all queries in the group. 24 Incremental Grouping • Create signature of new query. Place in lower parts of signature the most selective predicates. • Insert new query in the Group which best matches its signature bottom-up • If no match is found, create new group for this query • Store any timer information, and the data sources needed for this query 25 Other Issues (1) • Why write output to file, and not use pipelining?: – Pipelining would fire all actions, even if new needed to be fired – Pipelining does not work in timer-based queries where results need to be buffered. – Split operator may become a bottleneck if output is consumed in widely different rates – Query plan too complex for the optimizer 26 Other Issues (2) • Selection Operator above, or below Joins? – Below only if selections are very selective – Else, better to have one join • Range queries? – Like equality queries. Save lower, upper bound – Output in one common sorted file to eliminate duplicates. 27 Tukwilla • Adaptive Query Processing over autonomous data sources • Periodically change query plan if output of operators is not satisfactory. • At the end perform cleanup. Some calculated results may have to be thrown away. 28 Telegraph • Adaptive Query Engine based on Eddy Concept • Queries are over autonomous sources over the internet • Environment is unpredictable and data rates may differ significantly during query execution. Therefore the query processing SHOULD be adaptive. • Also can help produce partial results 29 Eddy • Eddy: Routes tuples to operators for processing, gets them back and routes them again… 30 Eddy – Knowing State of Tuples • Passes Tuples by Reference to Operators (avoids copying) • When Eddy does not have any more input tuples, it polls the sources for more input. • Tuples need to be augmented with additional information: – Ready Bits: Which operators need to be applied – Done Bits: Which operators have been applied – Queries Completed: Signals if tuple has been output or rejected by the query – Completion Mask (per query): To know when a tuple can be output for a query (completion mask & done bits = mask) 31 Eddy – Other Details • Queries with no joins are partitioned per data source (to save space in the bits required) • Queries with Disjunctions (OR’s) are transformed into conjunctive normal form (and of or’s). • Range/exact predicates are found in Grouped filter 32 Joins - SteMs • SteMs: Multiway-Pipelined Joins • Double- Pipelined Joins maintain a hash index on each relation. • When N relations are joined, at least n-2 inflight indices are needed for intermediate results even for left-deep trees. • Previous approach cannot change query plan without recomputing intermediate indices. 33 SteMs - Functionality • Keeps hash-table (or other index) on one data source • Can have tuples inserted into it (passed from eddy) • Can be probed. Intermediate tuples (results of join) are returned to eddy, with the appropriate bits set • Tuples have sequence numbers. A tuple X can join only with tuples in stem M, if the indexed tuples have lower sequence numbers than X (arrived earlier). 34 Telegraph – Routing • How to route between operators? – Route to operator with smaller queue – Route to more selective operators (ticket scheme) 35 Partial Results - Telegraph • Idea: When tuple returns to eddy, the tuple may contribute to final result (fields may be missing because of joins not performed yet). • Present tuple anyway. The missing fields will be filled later. • Tuple is guaranteed to be in result if referential constraints exist (foreign keys). Not usual in web sources. • Might be useful to the user to present tuples that do not have matching fields (like in outer join). 36 Partial Results - Telegraph • Results presented in tabular representation • User can: – Re-arrange columns – Drill down (add columns) or roll-up (remove columns) • Assume current area of focus is where user needs more tuples. • Weight tuples based on: – Selected columns and their order – Selected Values for some dimension • Eddy sorts tuples according to their benefit in result and schedule them accordingly 37 Partial Results – Other methods • Online Aggregation: Present current aggregate with error bounds, and continuously refine results • Previous approaches involved changing some blocking operators to be able to get partial results – – – – Join (use symmetric hash-join) Nest Average Except 38 Data Mining (1) • General problem: Data mining techniques usually require: – Entire dataset to be present (in memory or in disk) – Multiple passes of the data – Too much time per data element 39 Data Mining (2) • New algorithms should require: – – – – – Small constant time per record Use of a fixed amount of memory Use one scan of data Provide a useful model at all times Produce a model that would be close to the one produced by multiple passes over the same data if the dataset was available offline. – Alter the model when generating phenomenon changes over time 40 Decision Trees • Input: A set of examples (x, v) where x is a vector of D attributes and v is a discrete class label • Find at each node the best attribute to split. • Hoeffding bounds are useful here: – Consider a variable r with range R – N independent observations – Computed average r’ differs by true average of r by at most ε with probability 1-δ, where: R 2 ln( 1 / ) 2n 41 Hoeffding Tree • At each node maintain counts for each attribute X, and each value Xi of X and each correct class • Let G(Xi) be the heuristic measure to choose test attributes (for example, Gini index) • Assume two attributes A,B with maximum G • If G(A) – G(B) > ε, then with probability 1-δ, A is the correct attribute to split • Memory needed = O(dvc) (dimensions, values, classes) • Can prove that produced tree is very close to optimal tree. 42 VFDT Tree • Extension of Hoeffding tree • Breaks ties more aggressively (if they delay splitting) • Computes G after nmin tuples arrive (splits are not that often anyway) • Remove least promising leaf nodes if a memory problem exists (they may be reactivated later) • Drops attributes from consideration if at the beginning their G value is very small 43 CVFDT System • Source producing examples may significantly change behavior. • In some nodes of the tree, the current splitting attribute may not be the best anymore • Expand alternate trees. Keep previous one, since at the beginning the alternate tree is small and will probably give worse results • Periodically use a bunch of samples to evaluate qualities of trees. • When alternate tree becomes better than the old one, remove the old one. • CVFDT also has smaller memory requirements than VFDT over sliding window samples. 44 OLAP • On-Line Analytic Processing • Requires processing very large quantities of data to produce result • Usually updates are done in batch, and sometimes when system is offline • Organization of data extremely important (query response times, and mainly update times can vary by several orders of magnitude) 45 Terminology • • • • • Dimension Measure Aggregate Function Hierarchy What is the CUBE operator: – All 2D possible views, if no hierarchies exist D – levels( Dim ) , if hierarchies exist i i 1 46 Cube Representations - MOLAP • MOLAP: Multi-dimensional array • Good for dense cubes, as it does not store the attributes of each tuple • Bad in sparse cubes (high-dimensional cubes) • Needs no indexing if stored as is • Typical methods save dense set of dimensions in MOLAP mode, and index remaining dimensions with other methods • How to store? Chunk in blocks to speed up range queries 47 Cube Representations - ROLAP • Store views in relations • Needs to index produced relations, otherwise queries will be slow. • Indexes slow down updates • Issues: If limited size is available, which views to store? – Store fact table, and smaller views (ones who have performed most aggregation) – Queries usually specify few dimensions – These views are more expensive to compute on-the-fly 48 Research issues- ROLAP • How to compute the CUBE – Compute views from smaller parent – Share sort orders – Exhibit locality • Can the size of the cube be limited? – Prefix redundancy (Cube Forests) – Suffix Redundancy (Dwarf) – Approximation Techniques (Wavelets) 49 Research Issues - ROLAP • How to speed up selected classes of Queries: Range-Sum, Count…Different structures for each case (Partial Sum, Dynamic Data Cube) • How to best represent hierarchical data. Almost no research here. 50