Streaming Algorithms CS6234 Advanced Algorithms February 10 2015 1

Streaming Algorithms CS6234 Advanced Algorithms February 10 2015 1 The stream model • Data sequentially enters at a rapid rate from one or more inputs • We cannot store the entire stream • Processing in real-time • Limited memory (usually sub linear in the size of the stream) • Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable 2 Overview Counting bits with DGIM algorithm Bloom Filter Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications 3 Counting bits with DGIM algorithm Presented by Dmitrii Kharkovskii 4 Sliding windows • A useful model : queries are about a window of length N • The N most recent elements received (or last N time units) • Interesting case: N is still so large that it cannot be stored • Or, there are so many streams that windows for all cannot be stored 5 Problem description • Problem • Given a stream of 0’s and 1’s • Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N • Obvious solution • Store the most recent N bits (i.e., window size = N) • When a new bit arrives, discard the N +1st bit • Real Problem • Slow ‐ need to scan k‐bits to count • What if we cannot afford to store N bits? • Estimate with an approximate answer 6 Datar-Gionis-Indyk-Motwani Algorithm (DGIM) Overview • Approximate answer • Uses 𝑂(𝑙𝑜𝑔2 N) of memory • Performance guarantee: error no more than 50% • Possible to decrease error to any fraction 𝜀 > 0 with 𝑂(𝑙𝑜𝑔2 N) memory • Possible to generalize for the case of positive integer stream 7 Main idea of the algorithm Represent the window as a set of exponentially growing non-overlapping buckets 8 Timestamps • Each bit in the stream has a timestamp - the position in the stream from the beginning. • Record timestamps modulo N (window size) - use o(log N) bits • Store the most recent timestamp to identify the position of any other bit in the window 9 Buckets • Each bucket has two components: • Timestamp of the most recent end. Needs 𝑂(𝑙𝑜𝑔 N) bits • Size of the bucket - the number of ones in it. • Size is always 2𝑗 . • To store j we need 𝑂(𝑙𝑜𝑔 𝑙𝑜𝑔 N) bits • Each bucket needs 𝑂(𝑙𝑜𝑔 N) bits 10 Representing the stream by buckets • The right end of a bucket is always a position with a 1. • Every position with a 1 is in some bucket. • Buckets do not overlap. • There are one or two buckets of any given size, up to some maximum size. • All sizes must be a power of 2. • Buckets cannot decrease in size as we move to the left (back in time). 11 Updating buckets when a new bit arrives • Drop the last bucket if it has no overlap with the window • If the current bit is zero, no changes are needed • If the current bit is one • Create a new bucket with it. Size = 1, timestamp = current time modulo N. • If there are 3 buckets of size 1, merge two oldest into one of size 2. • If there are 3 buckets of size 2, merge two oldest into one of size 4. • ... 12 Example of updating process 13 Query Answering How many ones are in the most recent k bits? • Find all buckets overlapping with last k bits • Sum the sizes of all but the oldest one Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24 • Add the half of the size of the oldest one k 14 Memory requirements 15 Performance guarantee • Suppose the last bucket has size 2𝑟 . • By taking half of it, maximum error is 2𝑟−1 • At least one bucket of every size less than 2𝑟 • The true sum is at least 1+ 2 + 4 + … + 2𝑟−1 = 2𝑟 - 1 • The first bit of the last bucket is always equal to 1. • Error is at most 50% 16 References J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press 18 Bloom Filter Presented byNaheed Anjum Arafat 19 Motivation: The “Set Membership” Problem • x: An Element • S: A Set of elements (Finite) • Input: x, S • Output: Streaming Algorithm: • Limited Space/item • Limited Processing time/item • Approximate answer based on a summary/sketch of the data stream in the memory. • True (if x in S) • False (if x not in S) Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|) 20 Bloom Filter • Consists of • vector of n Boolean values, initially all set false (Complexity:- O(n) ) • k independent and uniform hash functions, ℎ0 , ℎ1 , … , ℎk−1 each outputs a value within the range {0, 1, … , n-1} F F F F F F F F F F 0 1 2 3 4 5 6 7 8 9 n = 10 21 Bloom Filter • For each element sϵS, the Boolean value at positions ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. • Complexity of Insertion:- O(k) 𝑠1 ℎ0 𝑠1 = 1 ℎ2 𝑠1 = 6 ℎ1 𝑠1 = 4 F TF F F FT F TF F F F 0 1 2 3 4 5 6 7 8 9 k=3 22 Bloom Filter • For each element sϵS, the Boolean value at positions ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. Note: A particular Boolean value may be set to True several times. 𝑠1 ℎ0 𝑠2 = 4 𝑠2 ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9 F T F F T F T TF F FT 0 1 2 3 4 5 6 7 8 9 k=3 23 Algorithm to Approximate Set Membership Query Input: x ( may/may not be an element) Output: Boolean For all i ϵ {0,1,…,k-1} if hi(x) is False return False return True Runtime Complexity:- O(k) 𝑠1 𝑠2 F T F F T F T T F T 0 1 2 3 4 5 6 7 8 9 𝑥 = S1 𝑥 = S3 k=3 24 Algorithm to Approximate Set Membership Query False Positive!! 𝑠1 ℎ0 𝑠1 = 1 𝑠2 ℎ2 𝑠1 = 6 ℎ0 𝑠2 = 4 ℎ1 𝑠1 = 4 ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9 F T F F T F T T F T 0 1 2 3 4 5 6 7 8 9 ℎ1 𝑥 = 6 ℎ2 𝑥 = 1 𝑥 ℎ0 𝑥 = 9 k=3 25 Error Types • False Negative – Answering “is not there” on an element which “is there” • Never happens for Bloom Filter • False Positive – Answering “is there” for an element which “is not there” • Might happens. How likely? 26 Probability of false positives S2 S1 F T F T F F T F T F n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item: 1 𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 𝑛 Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 {𝑆1 , 𝑆2 , … , 𝑆𝑚 }: ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 1 𝑚 𝑛 27 Probability of false positives S1 F T F S2 T F F T F T F n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 𝑆1 , 𝑆2 , … , 𝑆𝑚 We know that, 1 − ⇒ 1− 1 𝑘𝑚 = 𝑛 1 𝑛 𝑛 1− 1 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 − 𝑛 𝑘𝑚 1 ≈ e = 𝑒 −1 1 𝑛 𝑛 𝑘𝑚 𝑛 ≈ 𝑒 −1 𝑘𝑚 𝑛 = 𝑒 −𝑘𝑚 𝑛 28 Probability of false positives S1 F T F S2 T F F T F T n = size of table m = number of items k = number of hash functions F Approximate Probability of False Positive Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 = 𝐹 = 𝑒 −𝑘𝑚 𝑛 The prob. of having all k bits of a new element already set = 𝟏 − 𝒆− 𝒌𝒎 𝒏 𝒌 For a fixed m, n which value of k will minimize this bound? kopt = log𝑒 2 ⋅ 𝑛 𝑚 The probability of False Positive = 1 ( )𝑘𝑜𝑝𝑡 = 2 (0.6185) 𝑛 𝑚 Bit per item 29 Bloom Filters: cons • Small false positive probability • Cannot handle deletions • Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261 30 References • https://en.wikipedia.org/wiki/Bloom_filter • Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research • Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge 31 Count-Min Sketch Erick Purwanto A0050717L Motivation Count-Min Sketch • Implemented in real system – AT&T: network switch to analyze network traffic using limited memory – Google: implemented on top of MapReduce parallel processing infrastructure • Simple and used to solve other problems – Heavy Hitters by Joseph – Second Moment 𝐹2 , AMS Sketch by Manupa – Inner Product, Self Join by Sapumal Frequency Query • Given a stream of data vector 𝑥 of length 𝑛, 𝑥𝑖 ∈ [1, 𝑚] and update (increment) operation, – we want to know at each time, what is 𝑓𝑗 the frequency of item 𝑗 𝑗 𝑥… – assume frequency 𝑓𝑗 ≥ 0 • Trivial if we have count array [1, 𝑚] – we want sublinear space – probabilistically approximately correct Count-Min Sketch • Assumption: – family of 𝑑–independent hash function 𝐻 – sample 𝑑 hash functions ℎ𝑖 ← 𝐻 𝑥… 𝑗 ℎ𝑖 ∶ 1, 𝑚 → [1, 𝑤] 1 ℎ𝑖 (𝑗) 𝑤 • Use: 𝑑 indep. hash func. and integer array CM[𝑤, 𝑑] Count-Min Sketch • Algorithm to Update: – Inc(𝑗) : for each row 𝑖, CM[𝑖, ℎ𝑖 (𝑗)] += 1 𝑥… 𝑗 ℎ1 ℎ2 CM 1 +1 +1 ℎ𝑑 +1 1 𝑑 𝑤 Count-Min Sketch • Algorithm to estimate Frequency Query: – Count(𝑗) : 𝑓𝑗 = min𝑖 CM[𝑖, ℎ𝑖 (𝑗)] 𝑗 ℎ1 ℎ2 CM 1 ℎ𝑑 𝑑 1 𝑤 Collision • Entry CM 𝑖, ℎ𝑖 𝑗 is an estimate of the frequency of item 𝑗 at row 𝑖 – for example, ℎ1 5 = ℎ1 2 = 7 𝑥… 3 5 5 8 5 2 5 row 1 1 7 𝑤 • Let 𝑓𝑗 : frequency of 𝑗, and random variable 𝑋𝑖,𝑗 : frequency of all 𝑘 ≠ 𝑗, ℎ𝑖 𝑘 = ℎ𝑖 (𝑗) Count-Min Sketch Analysis row 𝑖 1 ℎ𝑖 (𝑗) 𝑤 • Estimate frequency of 𝑗 at row 𝑖: 𝑓𝑖,𝑗 = CM 𝑖, ℎ𝑖 𝑗 𝑛 = 𝑓𝑗 + 𝑓𝑘 𝑘≠𝑗, ℎ𝑖 𝑘 =ℎ𝑖 𝑗 = 𝑓𝑗 + 𝑋𝑖,𝑗 Count-Min Sketch Analysis • Let 𝜀 : approximation error, and set 𝑤 = 𝑒 𝜀 • The expectation of other item contribution: E[𝑋𝑖,𝑗 ] = 𝑘≠𝑗 𝑓𝑘 ⋅ Pr[ ℎ𝑖 𝑘 = ℎ𝑖 𝑗 ] ≤ Pr ℎ𝑖 𝑘 = ℎ𝑖 𝑗 1 = ⋅ 𝐹1 𝑤 𝜀 = ⋅ 𝐹1 𝑒 ⋅ 𝑘 𝑓𝑘 . Count-Min Sketch Analysis • Markov Inequality: Pr[ 𝑋 ≥ 𝑘 ∙ E 𝑋 ] ≤ 1 𝑘 • Probability an estimate 𝜀 ⋅ 𝐹1 far from true value: Pr 𝑓𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 = Pr[ 𝑋𝑖,𝑗 > 𝜀 ∙ 𝐹1 ] = Pr[ 𝑋𝑖,𝑗 > 𝑒 ⋅ E 𝑋𝑖,𝑗 ] 1 ≤ 𝑒 Count-Min Sketch Analysis • Let 𝛿 : failure probability, and set 𝑑 = ln(1 𝛿) • Probability final estimate far from true value: Pr 𝑓𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 = Pr ∀𝑖 ∶ 𝑓𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 = ( Pr 𝑓𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 )𝑑 ≤ 1 𝑒 = 𝛿 ln(1 𝛿 ) Count-Min Sketch • Result – dynamic data structure CM, item frequency query – set 𝑤 = 𝑒 𝜀 and 𝑑 = ln(1 𝛿) – with probability at least 1 − 𝛿, 𝑓𝑗 ≤ 𝑓𝑗 + 𝜀 ∙ 𝑘 𝑓𝑘 – sublinear space, does not depend on 𝑛 nor 𝑚 – running time update 𝑂(𝑑) and freq. query 𝑂(𝑑) Approximate Heavy Hitters TaeHoon Joseph, Kim Count-Min Sketch (CMS) • Inc(𝑗) takes 𝑂 𝑑 time –𝑂 1×𝑑 – update 𝑑 values • Count(𝑗) takes 𝑂 𝑑 time –𝑂 1×𝑑 – return the minimum of 𝑑 values Heavy Hitters Problem • Input: – An array of length 𝑛 with 𝑚 distinct items • Objective: 𝑛 𝑘 – Find all items that occur more than times in the array • there can be at most 𝑘 such items • Parameter –𝑘 Heavy Hitters Problem: Naïve Solution • Trivial solution is to use 𝑂 𝑚 array 1. Store all items and each item’s frequency 2. Find all 𝑘 items that has frequencies ≥ 𝑛 𝑘 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) • Relax Heavy Hitters Problem • Requires sub-linear space – cannot solve exact problem – parameters : 𝑘 and 𝜖 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) 𝑛 𝑘 1. Returns every item occurs more than times 𝑛 𝑘 2. Returns some items that occur more than − 𝜖 ∙ 𝑛 times – Count min sketch 𝑓𝑗 ≤ 𝑓𝑗 + 𝜀 ∙ 𝑓𝑘 𝑘 Naïve Solution using CMS … … m-2 m-1 m j ℎ2 ℎ1 ℎ𝑑 1 … 𝑑 1 𝑤 Naïve Solution using CMS • Query the frequency of all 𝑚 items – Return items with Count 𝑗 ≥ • 𝑂 𝑚𝑑 – slow 𝑛 𝑘 Better Solution • Use CMS to store the frequency • Use a baseline 𝑏 as a threshold at 𝑖𝑡ℎ item –𝑏= 𝑖 𝑘 • Use MinHeap to store potential heavy hitters at 𝑖𝑡ℎ item – store new items in MinHeap with frequency ≥ 𝑏 – delete old items from MinHeap with frequency < 𝑏 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) 𝑛 𝑘 1. Returns every item occurs more than times 𝑛 𝑘 2. Returns some items that occur more than − 𝜖 ∙ 𝑛 times – 1 𝜖 = 2𝑘 , 𝑛 then 𝐂𝐨𝐮𝐧𝐭 𝑥 ∈ [ 𝑓𝑥 , 𝑓𝑥 + 2𝑘 ] – ℎ𝑒𝑎𝑝 𝑠𝑖𝑧𝑒 = 2𝑘 Algorithm Approximate Heavy Hitters Input stream 𝑥, parameter 𝑘 For each item 𝑗 ∈ 𝑥 : 1. Update Count Min Sketch 2. Compare the frequency of 𝑗 with 𝑏 3. if count ≥ 𝑏 Insert or update 𝑗 in Min Heap 4. remove any value in Min Heap with frequency < 𝑏 Returns the MinHeap as Heavy Hitters 𝑖= EXAMPLES 1 Min-Heap 4 𝑘=5 𝑖 1 𝑏= = 𝑘 5 ℎ𝑑 ℎ2 ℎ1 1 1 1 … 𝑑 1 1 𝑤 𝑖= EXAMPLES 1 Min-Heap 4 𝑘=5 𝑖 1 𝑏= = 𝑘 5 ℎ𝑑 ℎ2 {1:4} ℎ1 1 1 1 … 𝑑 1 1 𝑤 𝑖= 1 2 3 4 5 4 2 6 9 3 EXAMPLES Min-Heap 𝑘=5 𝑖 5 𝑏= = 𝑘 5 {1:3} {1:2} ℎ1 ℎ𝑑 ℎ2 {1:9} {1:4} 1 1 1 … 𝑑 1 1 {1:6} 𝑤 𝑖= 1 2 3 4 5 6 4 2 6 9 3 4 EXAMPLES Min-Heap 𝑘=5 𝑖 6 𝑏= = 𝑘 5 {1:3} {1:2} ℎ𝑑 ℎ2 ℎ1 {1:9} 1 {1:4} 1 1 … 𝑑 1 1 {1:6} 𝑤 𝑖= 1 2 3 4 5 6 4 2 6 9 3 4 EXAMPLES Min-Heap 𝑘=5 𝑖 6 𝑏= = 𝑘 5 {1:3} {1:2} ℎ𝑑 ℎ2 ℎ1 {1:9} 2 {1:4} 1 2 … 𝑑 2 1 {1:6} 𝑤 𝑖= 1 2 3 4 5 6 4 2 6 9 3 4 ℎ𝑑 EXAMPLES Min-Heap 𝑘=5 𝑖 6 𝑏= = 𝑘 5 ℎ2 {2:4} ℎ1 2 1 2 … 𝑑 2 1 𝑤 𝑖= … EXAMPLES 79 Min-Heap 2 𝑘=5 𝑖 79 𝑏= = = 15.8 𝑘 5 {16:4} {20:9} ℎ1 ℎ𝑑 ℎ2 1 16 18 … 𝑑 15 1 𝑤 {23:6} 𝑖= … EXAMPLES 79 Min-Heap 2 𝑘=5 𝑖 79 𝑏= = = 15.8 𝑘 5 {16:4} {20:9} ℎ1 ℎ𝑑 ℎ2 1 17 19 … 𝑑 16 1 𝑤 {23:6} 𝑖= … EXAMPLES 79 Min-Heap 2 𝑘=5 𝑖 79 𝑏= = = 15.8 𝑘 5 {16:2} {16:4} ℎ1 ℎ𝑑 ℎ2 {20:9} 1 17 19 … 𝑑 16 1 𝑤 {23:6} 𝑖= … 79 80 81 2 1 2 EXAMPLES Min-Heap 𝑘=5 𝑖 80 𝑏= = = 16 𝑘 5 {16:2} {16:4} ℎ1 ℎ𝑑 ℎ2 {20:9} 1 3 6 … 𝑑 4 1 𝑤 {23:6} 𝑖= … 79 80 81 2 1 9 EXAMPLES Min-Heap 𝑘=5 𝑖 81 𝑏= = = 16.2 𝑘 5 {16:2} {16:4} ℎ𝑑 ℎ1 ℎ2 {20:9} 1 20 24 … 𝑑 25 1 𝑤 {23:6} 𝑖= … 79 80 81 2 1 9 EXAMPLES Min-Heap 𝑘=5 𝑖 81 𝑏= = = 16.2 𝑘 5 {16:2} {16:4} ℎ𝑑 ℎ1 ℎ2 {20:9} 1 21 25 … 𝑑 26 1 𝑤 {23:6} 𝑖= … 79 80 81 2 1 9 EXAMPLES Min-Heap 𝑘=5 𝑖 81 𝑏= = = 16.2 𝑘 5 {21:9} {23:6} ℎ𝑑 ℎ1 ℎ2 1 21 25 … 𝑑 26 1 𝑤 Analysis • Because 𝑛 is unknown, possible heavy hitters are calculated and stored every new item comes in • Maintaining the heap requires extra 𝑂 log 𝑘 = 𝑂 log 1 𝜀 time AMS Sketch : Estimate Second Moment Dissanayaka Mudiyanselage Emil Manupa Karunaratne The Second Moment • Stream : • The Second Moment : • The trivial solution would be : maintain a histogram of size n and get the sum of squares • Its not feasible maintain that large array, therefore we intend to find a approximation algorithm to achieve sub-linear space complexity with bounded errors • The algorithm will give an estimate within ε relative error with δ failure probability. (Two Parameters) The Method j +g2(j) +gd-1(j) d rows +g1(j) +gd(j) • j is the next item in the stream. • 2-wise independent d hash functions to find the bucket for each row • After finding the bucket, 4-wise independent d hash functions to decide inc/dec : • In a summary : The Method j +g2(j) +gd-1(j) d rows +g1(j) +gd(j) • Calculate row estimate • Median : 4 1 • Choose 𝑤 = 2 and 𝑑 = 8log( ) , by doing so it will give an estimate 𝜖 𝛿 with 𝜖 relative error and 𝛿 failure probability Why should this method give F2 ? +gk(j) d = 8log 1/δ j • For kth row : • Estimate F2 from kth row : • Each row there would be : • First part : • Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore the expectation is 0. What guarantee can we give about the accuracy ? • The variance of Rk, a row estimate, is caused by hashing collisions. • Given the independent nature of the hash functions, we can safely 2 𝐹 state the variance is bounded by 2 . • Using Chebyshev Inequality, • Lets assign, • 𝑤 • Still the failure probability is is linear in over 1 . 𝑤 What guarantee can we give about the accuracy ? • We had d number of hash functions, that produce R1, R2, …. Rd estimates. • The Median being wrong  Half of the estimates are wrong • These are independent d estimates, like toin-cosses that have exponentially decaying probability to get the same outcome. • They have stronger bounds, Chernoff Bounds : • 𝜇 = 𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ • • 𝑑 2 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑑 4 4 3 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏. ) 𝑎𝑤𝑎𝑦 𝑓𝑟om mean ∶ Space and Time Complexity • E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10 = 80 rows required • Space complexity is O(log(𝛿)). • Time complexity will be explained later along with the application AMS Sketch and Applications Sapumal Ahangama Hash functions • ℎk maps the input domain uniformly to 1,2, … 𝑤 buckets • ℎ𝑘 should be a pairwise independent hash functions, to cancel out product terms – Ex: family of ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤 – For a and b chosen from prime field 𝑝, 𝑎 ≠ 0 Hash functions • 𝑔𝑘 maps elements from domain uniformly onto {−1, +1} • 𝑔𝑘 should be four-wise independent • Ex: family of g x = 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 𝑚𝑜𝑑 𝑝 equations • g 𝑥 = 2 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 − 1 – for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝. Hash functions • These hash functions can be computed very quickly, faster even than more familiar (cryptographic) hash functions • For scenarios which require very high throughput, efficient implementations are available for hash functions, – Based on optimizations for particular values of p, and partial precomputations – Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In ACM-SIAM Symposium on Discrete Algorithms, 2004 Time complexity - Update • The sketch is initialized by picking the hash functions to use, and initializing the array of counters to all zeros • For each update operation, the item is mapped to an entry in each row based on the hash functions ℎ𝑗 , multiplied by the corresponding value of 𝑔𝑗 • Processing each update therefore takes time 𝑂(𝑑) – since each hash function evaluation takes constant time. Time complexity - Query • Found by taking the sum of the squares of each row of the sketch in turn, and finds the median of these sums. – That is for each row k, compute 𝑖 𝐶𝑀[𝑘, 𝑖]2 – Take the median of the d such estimates +gk(j) d = 8log 1/δ • Hence the query time is linear in the size of the sketch, j 𝑂(𝑤𝑑) Applications - Inner product • AMS sketch can be used to estimate the inner-product between a pair of vectors • Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′ 𝑀 𝑓. 𝑓 ′ = 𝑓 𝑖 ∗ 𝑓 ′ (𝑖) 𝑖=1 • AMS sketch based estimator is an unbiased estimator for the inner product of the vectors Inner Product • Two sketches 𝐶𝑀 and 𝐶𝑀’ • Formed with the same parameters and using the same hash functions (same 𝑤, 𝑑, ℎ𝑘 , 𝑔𝑘 ) • The row estimate is the inner product of the rows, 𝑤 𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖] 𝑖=1 Inner Product • Expanding 𝑤 𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖] 𝑖=1 • Shows that the estimate gives 𝑓 · 𝑓′ with additional crossterms due to collisions of items under ℎ𝑘 • The expectation of these cross terms is zero – Over the choice of the hash functions, as the function 𝑔𝑘 is equally likely to add as to subtract any given term. Inner Product – Join size estimation • Inner product has a natural interpretation, as the size of the equi-join between two relations… • In SQL, SELECT COUNT(*) FROM D, D’ WHERE D.id = D’.id Example UPDATE(23, 1) 23 h1 d=3 h2 h3 1 2 3 4 5 6 7 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 w=8 87 Example UPDATE(23, 1) 23 h1 h2 ℎ1 = 3 𝑔1 = −1 d=3 h3 ℎ3 = 7 𝑔3 = +1 ℎ2 = 1 𝑔2 = −1 1 2 3 4 5 6 7 8 1 0 0 -1 0 0 0 0 0 2 -1 0 0 0 0 0 0 0 3 0 0 0 0 0 0 +1 0 w=8 88 Example UPDATE(99, 2) 99 h1 d=3 h2 h3 1 2 3 4 5 6 7 8 1 0 0 -1 0 0 0 0 0 2 -1 0 0 0 0 0 0 0 3 0 0 0 0 0 0 +1 0 w=8 89 Example UPDATE(99, 2) 99 h1 h2 ℎ1 = 5 𝑔1 = +1 d=3 h3 ℎ3 = 3 𝑔3 = +1 ℎ2 = 1 𝑔2 = −1 1 2 3 4 5 6 7 8 1 0 0 -1 0 0 0 0 0 2 -1 0 0 0 0 0 0 0 3 0 0 0 0 0 0 +1 0 w=8 90 Example UPDATE(99, 2) 99 h1 h2 ℎ1 = 5 𝑔1 = +1 d=3 h3 ℎ3 = 3 𝑔3 = +1 ℎ2 = 1 𝑔2 = −1 1 2 3 4 5 6 7 8 1 0 0 -1 0 +2 0 0 0 2 -3 0 0 0 0 0 0 0 3 0 0 +2 0 0 0 +1 0 w=8 91

Streaming Algorithms CS6234 Advanced Algorithms February 10 2015 1

Related documents

Products

Support

Streaming Algorithms CS6234 Advanced Algorithms February 10 2015 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib