Calculating frequency moments of Data Stream Asad Narayanan COMP 5703 1 Outline • • • • • • • • • • What is data stream? Different constraints of data stream Application of data stream Frequency moments Calculating frequency moments Calculating F0 using FM-Sketch Calculating F0 Using KMV Complexity of calculating F0 Calculating Fk Complexity of calculating Fk 2 Data Stream • Sequence of voluminous data arriving at high speed • Can only be accessed one at a time • Comes in arbitrary order • Cannot be stored and processed later • Example Network analysis which will have around 1 million packets per second. • Our aim is to compute statistics on this data 3 Formal Definition T 0 a1 a2 a3 at am • Sequence of data A= ( a1, a2, a3,…, am ) of length M • ai ϵ (1,2,3,4,…,N) N distinct elements • mi = |{ j | aj = ai , 1≤ j ≤ m}| represents number of occurrences of ai • M and N very large • Impossible to store A on local disk. 4 Applications • Sensory networks • Network Monitoring systems • Data stream mining • Detecting credit card fraud • Database systems 5 Limitations • Recording all data is impossible • If we try to record all the IP addresses through a network, then we will require space of the order 232. • Data need to be processed in one pass • Store the data in limited space and time. • Obtain a sketch of data which can be reused to compute statistics. 6 Frequency Moments • A powerful statistical tool which can be used to determine demographic information of data • The k-th frequency moment of sequence A for k ≥ 0 is defined as: 𝑚 𝑚𝑖𝑘 𝐹𝑘 (𝐴) = 𝑖=1 • F0 represents the number of distinct elements in A • F1 represents total number of elements in A • Fk for k ≥ 2 gives idea about data distribution 7 Calculating Frequency moments • Direct approach requires memory of the order Ω(N) to store mi for all distinct elements ai ϵ (1,2,3,4,…,N) • But we have memory limitations, and requires an algorithm to compute in much lower memory • This can be achieved if we are ready to compromise on accuracy. • An algorithm that computes an (Ɛ,ƍ)- approximation of Fk, where Pr[|F’k- Fk|≤ Ɛ Fk] ≥ 1-ƍ • F’k is the (Ɛ,ƍ)- approximated value of Fk. • Ɛ is the approximation parameter and ƍ is the confidence parameter. 8 Calculating F0 • F0 is the zeroth frequency moment • Represent the number of distinct element in data sequence • Main application of F0 in query optimizer of large databases • To obtain the distinct number of elements in column without performing expensive sorting operations on entire column • The first algorithm to determine F0 was developed by Flajolet and Martin in their paper “Probabilistic counting algorithms for database applications” • Another major contribution was the development of K-minimum Value algorithm to determine the distinct number of elements. 9 Calculating F0 (FM-Sketch method) • Inspired from a paper by Robert Morris “Counting large numbers of events in small registers”. • Assumes there exist an ideal hash function that uniformly distributes the elements of the sequence into hash space • The hash space is assumed to be a bit string BITMAP[] of length L, initialized to 0 • Length L is assumed to be of the order of log(N) 10 FM-Sketch method (contd..) • Let bit(y,k) is the kth bit in binary representation of y • 𝜌 𝑦 represents the position of the least significant 1-bit in the binary representation of y 0 1 1 1 0 1 0 0 0 0 In the given example 𝜌 𝑦 =1 and bit(y,4)=0 0 1 2 3 4 5 6 7 8 9 • Let A be the sequence of data stream of length M • BITMAP[0…L-1] represents the hash space 11 FM-Sketch Algorithm For i:=0 to L-1, BITMAP[i] :=0 For all x in A , do: Index:=p(hash(x)) If BITMAP[index]=0, then BITMAP[index]=1 ENDIF EndFOR B:= Position of left most 0 bit of BITMAP[] Return 2^B END 12 FM-Sketch Example • Let the following represent data stream a1 a2 a3 a4 • Let the hashed values be H(a1)=011001 H(a2)=100101 H(a3)=101100 H(a4)=011011 • Then according to algorithm BITMAP will be equal to BITMAP=11000000 • First occurrence of 0-bit is at position 2 F0 = 2 2 = 4 13 FM-Sketch (Contd..) • If there are N distinct elements in a data stream: • If i>>Log(N) then BITMAP[i] is certainly 0 • If i<<log(N) then BITMAP[i] is certainly 1 • For I ~ log(N) BITMAP[i] is a fringes of 0s and 1’s • This algorithm is tested M online documentations of UNIX system • Which has total 26692 lines • 16405 lines where distinct • After hashing the lines the following BITMAP was obtained BITMAP= 111111111111001100000000 • Left most 0 appeared at position 12 and right most 1 appeared at position 15 • 214= 16384 • To improve the accuracy, the algorithm is extended by taking an array of bit strings instead of one and the position of 0 is averaged. 14 Calculating F0 (KMV Algorithm) • The problem with algorithm based on FM-Sketch is that they assume there exist ideal hash functions that uniformly distributes data into hash space • But in real it is difficult to get such hash function • Bar-Yossef et al. in [4], introduces k-minimum value algorithm for determining number of distinct elements in data stream. • uses a similar hash function h which is normalised to [0,1] as h:[m] → [0,1]. 15 Calculating F0 (KMV Algorithm) • A limit t is fixed to number of values in hash space. • t is assumed of the order 𝑂(1 Ɛ2 ) • At any point hash space contain t smallest hash values • Ѵ= 𝑀𝑎𝑥(ℎ 𝑎𝑖 ) is maximum of the hashed values • Ѵ is used to calculate F’0 using the below formula . 𝐹0′ = 𝑡 Ѵ 16 Calculating F0 (KMV Algorithm) Initialize First t values of KMV for a in a1 to an do if h(a) < Max(KMV ) then Remove Max(KMV) from KMV set Insert h(a) to KMV end if end for V=Max(KMV ) return t/V end 17 Example(KMV Algorithm) 1 0 0.5 • Let 8 distinct values of the stream be hashed as shown above. • Let t=4, then we keep only least 4 hashed values. (Highlighted in red) • This means, V=Max(first 4 hashed values) ~ 0.5 • F0= t/V = 4/0.5 = 8 18 Complexity of algorithms • Each hash value requires space of order 𝑂(log 𝑚 ) memory bits. • Number of hash values (t) is of the order 𝑂(1 Ɛ2 ) • Therefore KMV algorithm can be implemented in 𝑂(1 Ɛ2 . log 𝑚 ) memory bits space. • The access time can be reduced if we store the t hash values in a binary tree • Thus the time complexity will be reduced to 𝑂(log(1 Ɛ). log 𝑚 ). 19 Calculating Fk • Alon et al. estimates Fk by defining random variables X that can be computed within given space and time. • The approximate value of Fk is the expectation of the random variable X, E(X). • Construct a random variable X as follows • Select ap be a random member of sequence A with index at ‘p’. 𝑎𝑝 = 𝑙 ∈ 1,2,3 … , 𝑛 . • Let 𝑟 = |{𝑞 ∶ 𝑞 ≥ 𝑝, 𝑎𝑝 = 𝑙}|, represents the number of occurrences of 𝑙 within the members of the sequence A following 𝑎𝑝 . • Random variable 𝑋 = 𝑚(𝑟 𝑘 − 𝑟 − 1 𝑘 ) 20 Calculating FK (Contd…) 1 8𝑘𝑛1− 𝑘 1 𝑛1− 𝑘 • Let 𝑆1 = 𝜆2 which is of the order 𝜆2 and 𝑆2 = 2log(1 𝜀) which is of the order (log(1 𝜀) ) • Algorithm takes S2 random variables Y1, Y2,… YS2 and outputs median Y • Where Yi is the average of Xij 1 ≤ j ≤ S1 • Next we calculate Fk by calculating E(X). 21 Calculating FK (Contd…) 𝑛 𝑚𝑖 (𝑗𝑘 − (𝑗 − 1)𝑘 𝐸 𝑋 = = (1𝑘 + 2𝑘 − 1 𝑖=1 𝑗=1 𝑘 + ⋯ + (𝑚1 𝑘 − 𝑚1 − 1 𝑘 ) +(1𝑘 + 2𝑘 − 1𝑘 + ⋯ + (𝑚2 𝑘 − 𝑚2 − 1 𝑘 )) +⋯ +(1𝑘 + 2𝑘 − 1𝑘 + ⋯ + (𝑚𝑛 𝑘 − 𝑚𝑛 − 1 𝑘 )) = 𝑛 𝑘 𝑚 𝑖=1 𝑖 = 𝐹𝑘 22 Complexity of Fk • Each random variable X Stores ap and r • So space required for X can be of the order O(log(m) + log(n)) • There are S1 x S2 random variables • Hence total space complexity the algorithm takes is of the order 1 𝑘𝑙𝑜𝑔 1 1−𝑘 𝜀 𝑂 𝑛 (log 𝑛 + log 𝑚) 2 𝜆 23 Calculating F2 • Using previous discussed algorithm we can compute F2 in 𝑂( 𝑛 (log 𝑚 + log 𝑛) bits. • Alon et al. in their paper simplified this algorithm using four-wise independent random variables. • The complexity of algorithm is reduced to the following 1 𝑙𝑜𝑔 𝜀 (log 𝑛 + log 𝑚) 𝑂 𝜆2 24 Reference 1. 2. 3. 4. 5. 6. 7. Alon, Noga, Yossi Matias, and Mario Szegedy. 'The Space Complexity Of Approximating The Frequency Moments'. Journal of Computer and System Sciences 58.1 (1999): 137-147. Woodruff, David. 'Frequency Moments'. (2005): 2-3. Indyk, Piotr, and Woodruff David. 'Optimal Approximations Of The Frequency Moments Of Data Streams'. Proceedings of the thirty-seventh annual ACM symposium on Theory of computing - STOC '05(2005): 202. Ziv, Bar-Yossef et al. 'Counting Distinct Elements In A Data Stream.'. International Workshop on Randomization and Approximation Techniques 2483 (2002): 1-10. Philippe, Flajolet, and Nigel Martin G. 'Probablistic Counting Algorithms For Database Applications'. Journal of computer and system sciences 31.2 (1985): 182-209. Morris, Robert. 'Counting Large Numbers Of Events In Small Registers'. Communications of the ACM 21.10 (1978): 840-842. Flajolet, Philippe. 'Approximate Counting: A Detailed Analysis'. BIT 25.1 (1985): 113-134. 25 Thank you! 26