Mining from Data Streams - Competition between Quality and Speed Adapted from: Wei-Guang Teng (鄧維光) and S. Muthukrishnan’s presentations CS 636 - Adv. Data Mining (Wi 04/05) 1 Streaming: Finding Missing Numbers Paul permutes numbers 1…n, and shows all but one to Carole, in the permuted order, one after the other. Carole must find the missing number. Carole can not remember all the numbers she has been shown. CS 636 - Adv. Data Mining (Wi 04/05) 2 Streaming: Finding Missing Numbers Carole cumulates the sum of all the numbers that she has been shown. At the end she can subtract this sum from n(n+1)/2 Analysis Takes O(log n) bits to store the partial sum Performs one addition each time a new number is shown (takes O(log n) time per number) Performs one subtraction at the end (takes O(log n time) CS 636 - Adv. Data Mining (Wi 04/05) 3 Data Streams (1) Traditional DBMS – data stored in finite, persistent data sets New Applications – data input as continuous, ordered data streams Network monitoring and traffic engineering Telecom call detail records (CDR) ATM operations in banks Sensor networks Web logs and click-streams Transactions in retail chains Manufacturing processes CS 636 - Adv. Data Mining (Wi 04/05) 4 Data Streams (2) Definition Application Characteristics Continuous, unbounded, rapid, time-varying streams of data elements Massive volumes of data (can be several terabytes) Records arrive at a rapid rate Goal Mine patterns, process queries and compute statistics on data streams in real-time CS 636 - Adv. Data Mining (Wi 04/05) 5 Data Stream Algorithms Streaming involves Small number of passes over data. (Typically 1?) Sublinear space (sublinear in the universe or number of stream items?) Sublinear time for computing (?) Similar to dynamic, online, approximation or randomized algorithms, but with more constraints. CS 636 - Adv. Data Mining (Wi 04/05) 6 Data Streams: Analysis Model User/Application Query/Mining Target Results Stream Processing Engine Scratch Space (Memory and/or Disk) CS 636 - Adv. Data Mining (Wi 04/05) 7 Motivation 3 Billion Telephone Calls in US each day 30 Billion emails daily, 1 Billion SMS, IMs Scientific data: NASA's observation satellites generate billions of readings each day. IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many hundreds) of routers! Compare to human scale data: "only" 1 billion worldwide credit card transactions per month. CS 636 - Adv. Data Mining (Wi 04/05) 8 Network Management Application Monitoring and configuring network hardware and software to ensure smooth operation Measurements Alarms Network Operations Center Network CS 636 - Adv. Data Mining (Wi 04/05) 9 IP Network Measurement Data IP session data Source 10.1.0.2 18.6.7.1 13.9.4.3 15.2.2.9 12.4.3.8 10.5.1.3 11.1.0.6 19.7.1.2 Destination 16.2.3.7 12.4.0.3 11.6.8.2 17.1.2.1 14.8.7.4 13.0.0.1 10.3.4.5 16.5.5.8 Duration 12 16 15 19 26 27 32 18 Bytes 20K 24K 20K 40K 58K 100K 300K 80K Protocol http http http http http ftp ftp ftp AT&T collects 100 GBs of NetFlow data each day! CS 636 - Adv. Data Mining (Wi 04/05) 10 Network Data Processing Traffic estimation/analysis List the top 100 IP addresses in terms of traffic What is the average duration of an IP session? Fraud detection Identify all sessions whose duration was more than twice the normal Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify IP addresses involved in more than 1000 sessions CS 636 - Adv. Data Mining (Wi 04/05) 11 Challenges in Network Apps. 1 link with 2 Gb/s. Say avg packet size is 50 bytes. Number of pkts/sec = 5 Million. Time per pkt = 0.2 µsec. If we capture pkt headers per packet: src/dest IP, time, no of bytes, etc. at least 10 bytes. Space per second is 50 Mb. Space per day is 4.5 Tb per link. ISPs have hundreds of links. CS 636 - Adv. Data Mining (Wi 04/05) 12 Data Streaming Models Input data: a1, a2, a3, … Input stream describes a signal A[i], a one-dimensional function (value vs. index) There is mapping from the input stream to the signal This is the data stream model CS 636 - Adv. Data Mining (Wi 04/05) 13 Time-Series Model ai’s are form A[i]’s. CS 636 - Adv. Data Mining (Wi 04/05) 14 Cash-Register Model ai’s are increments to A[j] ai= (j, Ii) Ii >= 0 Ai[j] = Ai-1[j + Ii CS 636 - Adv. Data Mining (Wi 04/05) 15 Turnstile Model ai’s are updates to A[j] ai= (j, Ui) Ai[j] = Ai-1[j + Ui Strict turnstile model Ai[j] >= at all i CS 636 - Adv. Data Mining (Wi 04/05) 16 Data Stream Algorithms Compute various functions on the signal A at various times Performance measures Processing time per item ai in the stream Space used to store the data structure on At at time t Time needed to compute the functions on A CS 636 - Adv. Data Mining (Wi 04/05) 17 Outline Introduction & Motivation Issues & Techs. of Processing Data Streams Sampling Histogram Wavelet Data Streaming Systems System Example Algorithms for Frequency Counting Lossy Counting Sticky Sampling CS 636 - Adv. Data Mining (Wi 04/05) 18 Data Stream Algorithms Stream Processing Requirements Single pass: each record is examined at most once Bounded storage: limited memory for storing synopsis Real-time: per record processing time (to maintain synopsis) must be low Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory CS 636 - Adv. Data Mining (Wi 04/05) 19 Approximation in Data Streams Approximate Answers - Deterministic Bounds Algorithms only compute an approximate answer, but bounds on error Data Streaming Systems System Approximate Answers - Probabilistic Bounds Algorithms compute an approximate answer with high probability With probability at least 1 , the computed answer is within a factor of the actual answer CS 636 - Adv. Data Mining (Wi 04/05) 20 Sliding Window Approximation 011000011100000101010 Why? Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics Issues Extend relational algebra, SQL, query optimization Algorithmic work Timestamps? CS 636 - Adv. Data Mining (Wi 04/05) 21 Timestamps Explicit Implicit Injected by data source Models real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small buffers Introduced as special field by DSMS Arrival time in system Enables order-based querying and sliding windows Issues Distributed streams? Composite tuples created by DSMS? CS 636 - Adv. Data Mining (Wi 04/05) 22 Time Easiest: global system clock Stream elements and relation updates timestamped on entry to system Application-defined time Streams and relation updates contain application timestamps, may be out of order Application generates “heartbeat” Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress Query results in application time CS 636 - Adv. Data Mining (Wi 04/05) 23 Sampling: Basics A small random sample S of the data often wellrepresents all the data Example: select agg from R where R.e is odd (n=12) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 If agg is avg, return average of odd elements in S answer: 5 If agg is count, return average over all elements e in S of n if e is odd 0 if e is even CS 636 - Adv. Data Mining (Wi 04/05) answer: 12*3/4 =9 Unbiased! 24 Histograms Histograms approximate the frequency distribution of element values in a stream A histogram (typically) consists of A partitioning of element domain values into buckets A count C B per bucket B (of the number of elements in B) Long history of use for selectivity estimation within a query optimizer ([Koo80], [PSC84], etc) CS 636 - Adv. Data Mining (Wi 04/05) 25 Types of Histograms Equi-Depth Histograms Select buckets such that counts per bucket are equal Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values V-Optimal Histograms [IP95] [JKM98] Select buckets to minimize frequency variance within buckets minimize B vB ( f v CB 2 ) VB Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CS 636 - Adv. Data Mining (Wi 04/05) Domain values 26 Answering Queries using Histograms [IP99] (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation Example: select count(*) from R where 4<=R.e<=15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count spread evenly among bucket values 4 R.e 15 answer: 3.5 * C B For equi-depth histograms, maximum error: 2 * CB CS 636 - Adv. Data Mining (Wi 04/05) 27 Wavelet Basics For hierarchical decomposition of functions/signals Haar wavelets Simplest wavelet basis => Recursive pairwise averaging and differencing at different resolutions Resolution 3 2 Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] [2, 1 0 1, 4, [1.5, 4] 4] [2.75] ---[0, -1, -1, 0] [0.5, 0] [-1.25] Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0] CS 636 - Adv. Data Mining (Wi 04/05) 28 Haar Wavelet Coefficients Hierarchical decomposition structure (“error tree”) 2.75 + 0.5 + 2 0 - + 2 0 + -1 -1 2 3 -1.25 - - - + 0.5 0 - 0 0 0 - + 5 4 Original frequency distribution CS 636 - Adv. Data Mining (Wi 04/05) + 2.75 -1.25 + + Coefficient “Supports” -1 -4 -1 0 + + +-+ + -++ 29 Wavelet-based Histograms [MVW98] Problem: range-query selectivity estimation Key idea: use a compact subset of Haar wavelet coefficients for approximating frequency distribution Steps Compute cumulative frequency distribution C Compute Haar wavelet transform of C Coefficient thresholding: only m<<n coefficients can be kept CS 636 - Adv. Data Mining (Wi 04/05) 30 Using Wavelet-based Histograms Selectivity estimation: count(a<= R.e<= b) = C’[b] - C’[a-1] C’ is the (approximate) “reconstructed” cumulative distribution Time: O(min{m, logN}), where m = size of wavelet synopsis (number of coefficients), N= size of domain At most logN+1 coefficients are needed to reconstruct any C’ value C’[a] Empirical results over synthetic data shows improvements over random sampling and histograms CS 636 - Adv. Data Mining (Wi 04/05) 31 Data Streaming Systems Low-level application specific approach DBMS approach Generic data stream management systems CS 636 - Adv. Data Mining (Wi 04/05) 32 DBMS Vs. DSMS: Meta-Questions Killer-apps Motivation Application stream rates exceed DBMS capacity? Can DSMS handle high rates anyway? Need for general-purpose DSMS? Not ad-hoc, application-specific systems? Non-Trivial DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt? CS 636 - Adv. Data Mining (Wi 04/05) 33 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design CS 636 - Adv. Data Mining (Wi 04/05) Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns 34 (Simplified) Big Picture of DSMS Register Query Stored Result Streamed Result DSMS Input streams Archive Scratch Store CS 636 - Adv. Data Mining (Wi 04/05) Stored Relations 35 (Simplified) Network Monitoring Intrusion Warnings Online Performance Metrics Register Monitoring Queries DSMS Network measurements, Packet traces Archive Scratch Store CS 636 - Adv. Data Mining (Wi 04/05) Lookup Tables 36 Using Conventional DBMS Data streams as relation inserts, continuous queries as triggers or materialized views Problems with this approach Inserts are typically batched, high overhead Expressiveness: simple conditions (triggers), no built-in notion of sequence (views) No notion of approximation, resource allocation Current systems don’t scale to large # of triggers Views don’t provide streamed results CS 636 - Adv. Data Mining (Wi 04/05) 37 Query 1 (self-join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end CS 636 - Adv. Data Mining (Wi 04/05) 38 Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized CS 636 - Adv. Data Mining (Wi 04/05) 39 Query 3 (group-by aggregation) Total connection time for each caller SELECT FROM WHERE GROUP BY O1.caller, sum(O2.time – O1.time) Outgoing O1, Outgoing O2 (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) O1.caller Cannot provide result in (append-only) stream Output updates? Provide current value on demand? CS 636 - Adv. Data Mining (Wi 04/05) 40 Data Model Append-only Call records Updates Stock tickers Deletes Transactional data Meta-Data Control signals, punctuations System Internals – probably need all above CS 636 - Adv. Data Mining (Wi 04/05) 41 Related Database Technology DSMS must use ideas, but none is substitute Triggers, Materialized Views in Conventional DBMS Main-Memory Databases Sequence/Temporal/Timeseries Databases Realtime Databases Adaptive, Online, Partial Results Novelty in DSMS Semantics: input ordering, streaming output, … State: cannot store unending streams, yet need history Performance: rate, variability, imprecision, … CS 636 - Adv. Data Mining (Wi 04/05) 42 Outline Introduction & Motivation Data Stream Management System Issues & Techs. of Processing Data Streams Sampling Histogram Wavelet Example Algorithms for Frequency Counting Lossy Counting Sticky Sampling CS 636 - Adv. Data Mining (Wi 04/05) 43 Problem of Frequency Counts Stream Identify all elements whose current frequency exceeds support threshold s = 0.1% CS 636 - Adv. Data Mining (Wi 04/05) 44 Algorithm 1: Lossy Counting Step 1: Divide the stream into “windows” Window 1 Window 2 Window 3 Is window size a function of support s? Will fix later… CS 636 - Adv. Data Mining (Wi 04/05) 45 Lossy Counting in Action ... Frequency Counts + Empty First Window At window boundary, decrement all counters by 1 CS 636 - Adv. Data Mining (Wi 04/05) 46 Lossy Counting (cont’d) Frequency Counts + Next Window At window boundary, decrement all counters by 1 CS 636 - Adv. Data Mining (Wi 04/05) 47 Error Analysis How much do we undercount? If and then current size of stream =N window-size = 1/ε frequency error #windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% CS 636 - Adv. Data Mining (Wi 04/05) 48 Analysis of Lossy Counting Output Elements with counter values exceeding sN – εN Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least sN – εN How many counters do we need? Worst case: 1/ε log(εN) counters CS 636 - Adv. Data Mining (Wi 04/05) 49 Algorithm 2: Sticky Sampling Stream 28 31 41 23 35 19 34 15 30 Create counters by sampling Maintain exact counts thereafter What rate should we sample? CS 636 - Adv. Data Mining (Wi 04/05) 50 Sticky Sampling (cont’d) For finite stream of length N Sampling rate = 2/Nε log 1/(s ) ( = probability of failure) Output Elements with counter values exceeding sN – εN Same error guarantees as Lossy Counting but probabilistic! CS 636 - Adv. Data Mining (Wi 04/05) Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01% 51 Sampling rate? Finite stream of length N Sampling rate: 2/Nε log 1/(s) Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/εlog 1/s Independent of N! CS 636 - Adv. Data Mining (Wi 04/05) 52 New Directions Functional approximation theory Data structures Computational geometry Graph theory Databases Hardware Streaming models Data stream quality monitoring CS 636 - Adv. Data Mining (Wi 04/05) 53 References (1) [AGM99] N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS, 1999. [AMS96] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. ACM STOC, 1996. [CIK02] G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan. Fast mining of tabular data via approximate distance computations. IEEE ICDE, 2002. [CMN98] S. Chaudhuri, R. Motwani, and V. Narasayya. “Random Sampling for Histogram Construction: How much is enough?”. ACM SIGMOD 1998. [CDI02] G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms. VLDB, 2002. [DGG02] A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams. ACM SIGMOD, 2002. [DJM02] T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining database structure or how to build a data quality browser. ACM SIGMOD, 2002. [DH00] P. Domingos and G. Hulten. Mining high-speed data streams. ACM SIGKDD, 2000. [EKSWX98] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB 1998. [FKS99] J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. An approximate L1-difference algorithm for massive data streams. IEEE FOCS, 1999. [FM85] P. Flajolet, G.N. Martin. “Probabilistic Counting Algorithms for Data Base Applications”. JCSS 31(2), 1985 54 CS 636 - Adv. Data Mining (Wi 04/05) References (2) [Gib01] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB 2001. [GGI02] A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. Fast, smallspace algorithms for approximate histogram maintenance. ACM STOC, 2002. [GGRL99] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh: BOAT-Optimistic Decision Tree Construction. SIGMOD 1999. [GK01] M. Greenwald and S. Khanna. “Space-Efficient Online Computation of Quantile Summaries”. ACM SIGMOD 2001. [GKM01] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. VLDB 2001. [GKM02] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. “How to Summarize the Universe: Dynamic Maintenance of Quantiles”. VLDB 2002. [GKS01b] S. Guha, N. Koudas, and K. Shim. “Data Streams and Histograms”. ACM STOC 2001. [GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998. [GMP97] P. B. Gibbons, Y. Matias, and V. Poosala. “Fast Incremental Maintenance of Approximate Histograms”. VLDB 1997. [GT01] P.B. Gibbons, S. Tirthapura. “Estimating Simple Functions on the Union of Data Streams”. ACM SPAA, 2001. CS 636 - Adv. Data Mining (Wi 04/05) 55 References (3) [HHW97] J. M. Hellerstein, P. J. Haas, and H. J. Wang. “Online Aggregation”. ACM SIGMOD 1997. [HSD01] Mining Time-Changing Data Streams. G. Hulten, L. Spencer, and P. Domingos. ACM SIGKD 2001. [IKM00] P. Indyk, N. Koudas, S. Muthukrishnan. Identifying representative trends in massive time series data sets using sketches. VLDB, 2000. [Ind00] P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation. IEEE FOCS, 2000. [IP95] Y. Ioannidis and V. Poosala. “Balancing Histogram Optimality and Practicality for Query Result Size Estimation”. ACM SIGMOD 1995. [IP99] Y.E. Ioannidis and V. Poosala. “Histogram-Based Approximation of Set-Valued Query Answers”. VLDB 1999. [JKM98] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. “Optimal Histograms with Quality Guarantees”. VLDB 1998. [JL84] W.B. Johnson, J. Lindenstrauss. Extensions of Lipshitz Mapping into Hilbert space. Contemporary Mathematics, 26, 1984. [Koo80] R. P. Kooi. “The Optimization of Queries in Relational Databases”. PhD thesis, Case Western Reserve University, 1980. CS 636 - Adv. Data Mining (Wi 04/05) 56 References (4) [MRL98] G.S. Manku, S. Rajagopalan, and B. G. Lindsay. “Approximate Medians and other Quantiles in One Pass and with Limited Memory”. ACM SIGMOD 1998. [MRL99] G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. ACM SIGMOD, 1999. [MVW98] Y. Matias, J.S. Vitter, and M. Wang. “Wavelet-based Histograms for Selectivity Estimation”. ACM SIGMOD 1998. [MVW00] Y. Matias, J.S. Vitter, and M. Wang. “Dynamic Maintenance of Wavelet-based Histograms”. VLDB 2000. [PIH96] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. “Improved Histograms for Selectivity Estimation of Range Predicates”. ACM SIGMOD 1996. [PJO99] F. Provost, D. Jenson, and T. Oates. Efficient Progressive Sampling. KDD 1999. [Poo97] V. Poosala. “Histogram-Based Estimation Techniques in Database Systems”. PhD Thesis, Univ. of Wisconsin, 1997. [PSC84] G. Piatetsky-Shapiro and C. Connell. “Accurate Estimation of the Number of Tuples Satisfying a Condition”. ACM SIGMOD 1984. [SDS96] E.J. Stollnitz, T.D. DeRose, and D.H. Salesin. “Wavelets for Computer Graphics”. Morgan-Kauffman Publishers Inc., 1996. CS 636 - Adv. Data Mining (Wi 04/05) 57 References (5) [T96] H. Toivonen. Sampling Large Databases for Association Rules. VLDB 1996. [TGI02] N. Thaper, S. Guha, P. Indyk, N. Koudas. Dynamic Multidimensional Histograms. ACM SIGMOD, 2002. [U89] P. E. Utgoff. Incremental Induction of Decision Trees. Machine Learning, 4, 1989. [U94] P. E. Utgoff: An Improved Algorithm for Incremental Induction of Decision Trees. ICML 1994. [Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985. CS 636 - Adv. Data Mining (Wi 04/05) 58