Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara 1 Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. – Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. – Show Pay-Per-Click advertisements. – Retrieve top advertisements to choose what to display. 3 Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: – Very related, though, no integrated solution has been proposed – Exact solution is O(min(N,A)) space approximate variations 4 Practical Frequent Elements -Deficient Frequent Elements [Manku ‘02]: – All frequent elements output should have F > (φ - )N, where is the user-defined error. φN (φ - ) N 5 Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]: – Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kth ranked element. F4 (1 - ) F4 6 Related Work Algorithms Classification – Counter-Based techniques • Keep an individual counter for each element • If the observed ID is monitored, its counter is updated • If the observed ID is not monitored, algorithm dependent action – Sketch-Based techniques • Estimate frequency for all elements using bit-maps of counters • Each element is hashed into the counters’ space using a family of hash functions. • Hashed-to counters are queried for the frequencies 7 Recent Work (Comparison) Algorithm Nature Space Bound Handles CountSketch [Charikar ‘02] Sketch O(k/2 log N/δ), δ is the failure probability FindApproxTop (S, k, ) GroupTest [Cormode ’03] Sketch O(φ-1 log(φ-1) log(|A|)) Hot Items Frequent [Demaine ’02] Counter O(1/), proved by [Bose FE Probabilistic-Inplace [Demaine ’02] Counter O(m), m is the available memory FindCandidate Top(S, k, m/2) Lossy Counting [Manku ’02] Counter (1/) log(N) -Deficient FE Sticky Sampling [Manku ’02] Counter (2/) log(φ-1δ-1) -Deficient FE ‘03] 8 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion 9 The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors 10 Space-Saving By Example A B Element Count 2 3 4 5 error (max possible) 0 B A E 2 3 4 0 3 C D A 1 2 3 4 1 0 3 ABBACABBDDBE C Space-Saving Algorithm – For every element in the stream S – If a monitored element is observed • Increment Incrementits itsCount Count – If a non-monitored element is observed, • Replace Replace the theelement elementwith withminimum minimum hits, hits, minmin • Increment Increment the theminimum minimum Count Count to to minmin + 1+ 1 • maximum maximum possible possibleover-estimation over-estimation is error is error 11 Space-Saving Observations S = ABBACABBDDBEC N = 13 Observations: – The summation of the Counts is N – Minimum number of hits, min ≤ N/m – In this example, min = 4 – The minimum number of hits, min, is an upper bound on the error of any element B Element Count 5 error (max possible) 0 E 4 3 C 4 3 12 Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. 2. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4. B Element Count 5 error (max possible) 0 E 4 3 C 4 3 13 Space-Saving Data Structure We need a data structure that – Increments counters in constant time – Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02] 16 Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element 18 Frequent Elements Example Element B Count error D G A Q F C E 20 14 12 9 7 5 3 3 1 4 1 3 0 1 2 8 8 4 5 2 1 0 Guaranteed Hits = Count - error 19 14 For N = 73, m = 8, φ = 0.15: – Frequent Elements should have support of 11 hits. – Candidate Frequent Elements are B, D, and G. – Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11. 19 Frequent Elements Space Bounds Space Bounds General Distribution Zipf(α) Space-Saving O(1/) (1/)(1/α) GroupTest O(φ-1 log(φ-1) log(|A|)) Frequent O(1/) proved by[Bose’03] Lossy Counting (1/) log(N) Sticky Sampling (2/) log(φ-1δ-1) 20 Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: – Guaranteed top-k elements: • Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. – Guaranteed top-k’ (where k’≈k): • The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1. 26 Top-k Elements Example Element B Count error D G A Q F C E 20 14 12 9 7 5 3 3 1 4 1 3 0 1 2 8 8 4 5 2 1 0 Guaranteed Hits = Count - error 19 14 For k = 3, m = 8: – B, D, and G are the top-3 candidates. – B, and D are guaranteed to be in the top-3. – B , D, G and A are guaranteed to be the top-4. Here k’ = 4. – B , and D are guaranteed to be the top-2. Another k’ = 2. 27 Top-k Elements Space Bounds Space Bounds SpaceSaving General Distribution FindApproxTop(S, k, ): O(k/ * log(N)) CountSketch FindApproxTop(S, k, ): O(k/2 * log(N / δ)) Zipf(α) Exact Top-k Problem: α = 1: O(k2 log(A) ) α > 1: O((k/ α)(1/α) k ) FindApproxTop(S, k, ): α ≥ 1: O(k * log(N / δ)) 28 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion 32 Experimental Results - Setup Synthetic data: – Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 – N = 107 hits. Real Data (ValueClick, Inc.): Similar results Precision: – number of correct elements found / entire output Recall: – number of correct elements found / number of actual correct Run time: – Processing Stream + Query Time Space used: – Including hash table 33 Frequent Elements Results Query: φ = 10-2, = 10-4, and δ = 10-2 We compared with – GroupTest and Frequent All algorithms had a recall of 1. – That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct 34 Frequent Elements Precision Precision for Frequent Elements (>100,000 Hits) on Synthetic Data Space-Saving 1 1 1 1 1 0.9 1 GroupTest 1 1 1 Frequent 1 1 1 1 1 0.833333 Precision 0.8 0.7 0.6 0.5 0.4 0.3 0.2157 0.2 0.1053 0.1 0 0.0707 0.0526 0.0889 0 0 0 0.5 1 1.5 2 2.5 3 Zipf Alpha 35 Frequent Elements Run Time Run Time for Frequent Elements (>100,000 Hits) on Synthetic Data Space-Saving GroupTest Frequent 60000 Run Time (ms) 50000 50031 47937 49578 45172 43844 43734 43141 40000 30000 24281 28015 26500 26125 27250 27218 25906 20000 11906 12281 10375 7453 10000 7516 7593 6704 0 0 0.5 1 1.5 2 2.5 3 Zipf Alpha 36 Frequent Elements Space Used Space Used for Frequent Elements (>100,000 Hits) on Synthetic Data Space-Saving GroupTest Frequent Space Used (Bytes) 180000 168260 160000 168260 168260 168260 168260 168260 168260 140000 120000 100000 78460 80000 67756 58460 60000 38240 40000 13760 13760 20000 13760 13760 13760 16588 13760 5636 13760 2796 0 0 0.5 1 1.5 2 2.5 3 Zipf Alpha 37 Top-k Elements Results Query: k = 100, = 10-4, and δ = 10-2 We compared with – CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. – Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct 38 Top-k Elements Precision Precision for Top-100 on Synthetic Data Space-Saving 1 1 1 0.92 1 CountSketch 0.98 1 Probabilistic InPlace 1 0.99 1 0.99 1 1 1 1 1 0.9 0.8 Precision 0.7 0.6 0.5 0.358423 0.4 0.3 0.2 0.1 0.133333 0.1 0.0182 0.02 0.02 0 0 0.5 1 1.5 2 2.5 3 Zipf Alpha 39 Top-k Elements Recall Recall for Top-100 on Synthetic Data Space-Saving 1 1 1 0.91 1 0.92 CountSketch Probabilistic InPlace 1 0.98 1 1 0.99 1 1 0.99 1 1 1 1 1 1 1 1 1.5 2 2.5 3 0.9 0.8 Recall 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0 0 0.5 Zipf Alpha 40 Top-k Elements Run Time Run Time for Top-100 on Synthetic Data Space-Saving 2000000 1931797 CountSketch Probabilistic InPlace 1860453 1800000 Run Time (ms) 1600000 1339343 1400000 1200000 1000000 848141 768547 800000 757922 754813 600000 400000 200000 30375 27609 26391 23531 27984 28985 30078 32078 26125 25422 25703 29797 32250 25390 0 0 0.5 1 1.5 2 2.5 3 Zipf Alpha 41 Top-k Elements Space Used Space Used for Top-100 on Synthetic Data Space-Saving Space Used (Bytes) 450000 406330 407070 CountSketch 407070 407070 Probabilistic InPlace 407010 406570 403930 400000 350000 300000 250000 200000 150000 78460 100000 58460 50000 20338 39418 38240 67756 62674 10874 0 0 0.5 1 1.5 6534 3436 15470 6916 16588 2 2.5 3254 3 Zipf Alpha 42 Conclusion Contributions: – An integrated approach to solve an interesting family of problems – Strict error bounds using little space – Guarantees on results – Special attention was given to Zipfian data – Experimental validation Future Work: – Incremental frequent and top-k elements reporting 44