Randomization for Massive and Streaming Data Sets Rajeev Motwani May 21, 2003 CS Forum Annual Meeting 1 Data Streams Mangement Systems Traditional DBMS – data stored in finite, persistent data sets Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, … Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets 2 DSMS – Big Picture Register Query Streamed Result Stored Result DSMS Input streams Archive Scratch Store Stored Relations 3 Algorithmic Issues Computational Model Streaming data (or, secondary memory) Bounded main memory Techniques New paradigms Negative Results and Approximation Randomization Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory) 4 Stream Model of Computation 1 1 0 0 1 Main Memory (Synopsis Data Structures) 0 1 1 0 1 1 Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size Data Stream ε: error parameter 5 “Toy” Example – Network Monitoring Intrusion Warnings Online Performance Metrics Register Monitoring Queries DSMS Network measurements, Packet traces, … Archive Scratch Store Lookup Tables 6 Frequency Related Problems Analytics on Packet Headers – IP Addresses Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency? 7 Example 1– Distinct Values Input Sequence X = x1, x2, …, xn, … Domain U = {0,1,2, …, u-1} Compute D(X) number of distinct values Remarks Assume stream size n is finite/known (generally, n is window size) Domain could be arbitrary (e.g., text, tuples) 8 Naïve Approach Counter C(i) for each domain value i Initialize counters C(i) 0 Scan X incrementing appropriate counters Problem Memory size M << n Space O(u) – possibly u >> n (e.g., when counting distinct words in web crawl) 9 Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof: Information-theoretic arguments Note: Leaves open randomization/approximation 10 Randomized Algorithm h:U [1..t] Input Stream Hash Table Analysis Random h few collisions & avg list-size O(n/t) Thus Space: O(n) – since we need t = Time: O(1) per item [Expected] Ω(n) 11 Improvement via Sampling? Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r Benefit – sublinear space Cost – estimation error is high Why? – low-frequency values underrepresented 12 Negative Result for Sampling Consider estimator E of D(X) examining r items in X Possibly in adaptive/randomized fashion. r Theorem: For any δ e , E has relative error nr 1 ln 2r δ with probability at least δ. Remarks r = n/10 Error 75% with probability ½ Leaves open randomization/approximation on full scans 13 Randomized Approximation Simplified Problem – For fixed t, is D(X) >> t? Choose hash function h: U[1..t] Initialize answer to NO For each xi, if h(xi) = t, set answer to YES h:U [1..t] 1 t Boolean Flag YES/NO Input Stream Observe – need 1 bit memory only ! Theorem: If D(X) < t, P[output NO] > 0.25 If D(X) > 2t, P[output NO] < 0.14 14 Analysis Let – Y be set of distinct elements of X output NO no element of Y hashes to t P [element hashes to t] = 1/t Thus – P[output NO] = (1-1/t)|Y| Since |Y| = D(X), D(X) < t D(X) > 2t P[output NO] > (1-1/t)t > 0.25 P[output NO] < (1-1/t)2t < 1/e^2 15 Boosting Accuracy With 1 bit distinguish D(X)<t from D(X)>2t Running O(log 1/δ) instances in parallel reduce error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8,…, n can estimate D(X) within factor 2 Choice of multiplier 2 is arbitrary can use factor (1+ε) to reduce error to ε Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space n 1 O(log 2 log ) ε δ 16 Example 2 – Elephants-and-Ants Stream Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] 17 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window 3 Window-size W is function of support s – specify later… 18 Lossy Counting in Action ... Frequency Counts + Empty First Window At window boundary, decrement all counters by 1 19 Lossy Counting continued ... Frequency Counts + Next Window At window boundary, decrement all counters by 1 20 Error Analysis How much do we undercount? If and current size of stream window-size W =N = 1/ε then frequency error # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% 21 Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? Worst case bound: 1/ε log εN counters Implementation details… 22 Algorithm 2: Sticky Sampling Stream Create counters by sampling Maintain exact counts thereafter What is sampling rate? 28 31 41 23 35 19 34 15 30 23 Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/s = probability of failure Output: Elements with counter values exceeding (s-ε)N Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N Same Rule of thumb: Same error guarantees as Lossy Counting but probabilistic Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01% 24 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/s Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/ log 1/s Independent of N 25 Example 3 – Correlated Attributes R1 R2 R3 R4 R5 R6 R7 R8 … C1 1 1 1 0 1 1 0 0 C2 1 1 0 0 1 1 1 1 … C3 1 0 0 1 1 1 1 1 C4 1 1 1 0 0 1 1 1 … C5 0 0 0 1 1 1 1 0 Input Stream – items with boolean attributes Matrix – M(r,c) = 1 Row r has Attribute c Identify – Highly-correlated column-pairs 26 Correlation Similarity View column as set of row-indexes (where it has 1’s) Set Similarity (Jaccard measure) sim(C i , C j ) Example Ci 0 1 1 0 1 0 Cj 1 0 1 0 1 1 Ci C j Ci C j sim(Ci,Cj) = 2/5 = 0.4 27 Identifying Similar Columns? Goal – finding candidate pairs in small memory Signature Idea Hash columns Ci to small signature sig(Ci) Set of signatures fits in memory sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj)) Naïve Approach Sample P rows uniformly at random Define sig(Ci) as P bits of Ci in sample Problem sparsity would miss interesting part of columns sample would get only 0’s in columns 28 Key Observation For columns Ci, Cj, four types of rows A B C D Ci 1 1 0 0 Cj 1 0 1 0 Overload notation: A = # rows of type A Observation A sim(C i , C j ) A BC 29 Min Hashing Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj) Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) if type A row 30 Min-Hash Signatures Pick – k random row permutations Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C Similarity of signatures Define: sim(sig(Ci),sig(Cj)) = fraction of permutations where Min-Hash values agree Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj) 31 Example R1 R2 R3 R4 R5 C1 1 0 1 1 0 C2 0 1 0 0 1 C3 1 1 0 1 0 Signatures S1 Perm 1 = (12345) 1 Perm 2 = (54321) 4 Perm 3 = (34512) 3 S2 2 5 5 S3 1 4 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00 32 Implementation Trick Permuting rows even once is prohibitive Row Hashing Pick k hash functions hk: {1,…,n}{1,…,O(n)} Ordering under hk gives random row permutation One-pass implementation 33 Comparing Signatures Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures Need – Pair-wise similarity of signature columns Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing 34 Summary New algorithmic paradigms needed for streams and massive data sets Negative results abound Need to approximate Power of randomization 35 Thank You! 36 References Rajeev Motwani (http://theory.stanford.edu/~rajeev) STREAM Project (http://www-db.stanford.edu/stream) STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003. Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003. Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002. Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003. Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003. Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003. 37 References (contd) Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing 2002. Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data. SODA 2002. O’Callahan-Guha-Mishra-Meyerson-Motwani. HighPerformance Clustering of Streams and Large Data Sets. ICDE 2003. Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000. Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000. Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000. Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999. Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998. 38