Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia Introduction Counting distinct objects: Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications traffic management, call centers, wireless communication, stock market etc. Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: FM based algorithms SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) k-Skyband (Accuracy guarantee + efficient + no space usage guarantee) FM Algorithm FM SKETCH Let h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1bit of h(x) FM be an array of size k initialized to zero For each record x in dataset FM[pivot] = 1; Let B=FMmin be the position of left most 0-bit of FM Number of distinct elements = α * 2B where α = 1.2897385 Each bit i of h(x) has 1/2 probability to be one r1 r2 r1 r3 r1 k=4 h(r1) 1 0 1 0 h(r2) 0 0 1 0 h(r3) 1 1 0 1 FM 0 1 0 0 1 0 P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985 FMmin = 1 FM Algorithm Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2i+1 Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2i+1 times If i >> log2 n FM[i] will almost certainly be zero If i << log2 n FM[i] will almost certainly be one If i ≈ log2 n FM[i] may be zero or one Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n. r1 r2 r1 r3 r1 h(r1) 1 0 1 0 h(r2) 0 0 1 0 h(r3) 1 1 0 1 1 0 1 0 FM FMmin = 1 FM Algorithm Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset FM1 1 0 1 0 0 0 For each hash function hi(x) B1 = 1 FMi[pivot] = 1; Let Bi be the position of left most 0-bit of FMi B = (B1 + B2 … + Br )/ r Number of distinct elements = α * 2B where α = 1.2897385 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) FM2 1 1 B2 = 2 FM3 1 1 0 1 B3 = 2 B = (1 + 2 + 2)/3 = 1.67 FM-based Algorithm Maintaining one FM sketch For each record (x,t) in dataset FM[pivot] = t; Answering a query For any t, let B = FMmin (t) be the position of left most entry of FM with value less than t Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385 1 2 3 4 5 r1 r2 r3 r2 r2 h(r1) 1 0 1 0 h(r2) 0 0 1 0 h(r3) 1 1 0 1 0 1 3 0 0 0 4 5 2 FM FMmin (4) = 0 FM-based Algorithm Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset For each hash function hi(x) FMi[pivot] = t; Answering a query For any t, let Bi (t) be the position of left most entry smaller than t in i-th FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385 Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) Total Space: O(1/є2 log 1/δ log m) Total maintenance cost for one record: O(1/є2 log 1/δ log log m) Total query cost: O(1/є2 log 1/δ log log m) PCSA-based Algorithm Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset For each of the j hash functions H() i = H(x) Update i-th FM sketch Answering a query For any t, let Bi (t) be the position of left most entry smaller than t in ith FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j where α = 1.2897385 Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice BJKST Algorithm Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record x, we generate its hash value h(x) • Maintain k-th smallest distinct hash value k_min Number of distinct elements = n = km3/k_min Improved algorithm • Use r hash functions • Compute ni for each hash function hi() as above • Report final answer as median of ni values Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in and r = O(log 1/δ) datastream. In RANDOM'02. K-Skyband Technique Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record (x,t’) we generate h(x) and store record (x, h(x), t’) Answering a query q(t): • Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t • Get the k-th smallest distinct hashed value and apply BJKST algorithm Limitation: Requires storing all records K-Skyband Technique For any time t, we need to find k-th smallest hash value arriving no later than t A record x dominates another record y if x arrives after y and has smaller hash value K-Skybands keeps only the objects that are dominated by at most (k-1) records Maintaining K-Skyband: • Keep a counter for each record • When a new element (x,t) arrives, increment the counter of all records dominated by it • Remove the records with counter at least equal to k k=2 b e c t d a We increment the counters of groups to improve efficiency (Domination aggregation search tree) h(x) K-Skyband Technique Answering Query: Find k_min (the k-th smallest hash value among elements arriving no later than t) • • Let z be the number of elements arrived before t k_min is the (z+k)-th overall smallest hash value k_min = 5th smallest h(x) k=2 Algorithm: • Maintain a binary search tree eT that stores elements according to t • Maintain a binary search tree eH that stores elements t according to h(x) When a query q(t) arrives a • Compute z by using eT f • Find (z+k)-th overall smallest hash value from eH b e c d z=3 h(x) Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Expected total space: O(1/є2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є + log n)) Experiments • • Synthetic datasets following Uniform and Zipf distribution Real dataset WorldCup 98 HTTP requests (20 M records) j Space Efficiency Space Efficiency Time Efficiency Maintenance cost Time Efficiency Query response time Accuracy Thanks P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Space usage: 1/ε2 log 1/δ m1/2 Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatiotemporal aggregation using sketches. In ICDE 2004. Space usage: O(N/ε2 log 1/δ log m) Space Requirement (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m); Total Space: O(1/є2 log 1/δ log m) Time Complexity (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m) To answer a query, we search in r sketches Total cost: O( r log log m) = O(1/є2 log 1/δ log log m) Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) 1-δ Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n) Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) 1-δ Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))