Calculating frequency moments

advertisement
Calculating frequency
moments of Data Stream
Asad Narayanan
COMP 5703
1
Outline
•
•
•
•
•
•
•
•
•
•
What is data stream?
Different constraints of data stream
Application of data stream
Frequency moments
Calculating frequency moments
Calculating F0 using FM-Sketch
Calculating F0 Using KMV
Complexity of calculating F0
Calculating Fk
Complexity of calculating Fk
2
Data Stream
• Sequence of voluminous data arriving at high speed
• Can only be accessed one at a time
• Comes in arbitrary order
• Cannot be stored and processed later
• Example Network analysis which will have around 1 million packets
per second.
• Our aim is to compute statistics on this data
3
Formal Definition
T
0
a1
a2
a3
at
am
• Sequence of data A= ( a1, a2, a3,…, am ) of length M
• ai ϵ (1,2,3,4,…,N) N distinct elements
• mi = |{ j | aj = ai , 1≤ j ≤ m}| represents number of occurrences of ai
• M and N very large
• Impossible to store A on local disk.
4
Applications
• Sensory networks
• Network Monitoring systems
• Data stream mining
• Detecting credit card fraud
• Database systems
5
Limitations
• Recording all data is impossible
• If we try to record all the IP addresses through a network, then we will
require space of the order 232.
• Data need to be processed in one pass
• Store the data in limited space and time.
• Obtain a sketch of data which can be reused to compute statistics.
6
Frequency Moments
• A powerful statistical tool which can be used to determine
demographic information of data
• The k-th frequency moment of sequence A for k ≥ 0 is defined as:
𝑚
𝑚𝑖𝑘
𝐹𝑘 (𝐴) =
𝑖=1
• F0 represents the number of distinct elements in A
• F1 represents total number of elements in A
• Fk for k ≥ 2 gives idea about data distribution
7
Calculating Frequency moments
• Direct approach requires memory of the order Ω(N) to store mi for all
distinct elements ai ϵ (1,2,3,4,…,N)
• But we have memory limitations, and requires an algorithm to
compute in much lower memory
• This can be achieved if we are ready to compromise on accuracy.
• An algorithm that computes an (Ɛ,ƍ)- approximation of Fk, where
Pr[|F’k- Fk|≤ Ɛ Fk] ≥ 1-ƍ
• F’k is the (Ɛ,ƍ)- approximated value of Fk.
• Ɛ is the approximation parameter and ƍ is the confidence parameter.
8
Calculating F0
• F0 is the zeroth frequency moment
• Represent the number of distinct element in data sequence
• Main application of F0 in query optimizer of large databases
• To obtain the distinct number of elements in column without performing
expensive sorting operations on entire column
• The first algorithm to determine F0 was developed by Flajolet and
Martin in their paper “Probabilistic counting algorithms for database
applications”
• Another major contribution was the development of K-minimum
Value algorithm to determine the distinct number of elements.
9
Calculating F0 (FM-Sketch method)
• Inspired from a paper by Robert Morris “Counting large numbers of
events in small registers”.
• Assumes there exist an ideal hash function that uniformly distributes
the elements of the sequence into hash space
• The hash space is assumed to be a bit string BITMAP[] of length L,
initialized to 0
• Length L is assumed to be of the order of log(N)
10
FM-Sketch method (contd..)
• Let bit(y,k) is the kth bit in binary representation of y
• 𝜌 𝑦 represents the position of the least significant 1-bit in the binary
representation of y
0 1 1 1 0 1 0 0 0 0
In the given example 𝜌 𝑦 =1 and bit(y,4)=0
0 1 2 3 4 5 6 7 8 9
• Let A be the sequence of data stream of length M
• BITMAP[0…L-1] represents the hash space
11
FM-Sketch Algorithm
For i:=0 to L-1, BITMAP[i] :=0
For all x in A , do:
Index:=p(hash(x))
If BITMAP[index]=0, then
BITMAP[index]=1
ENDIF
EndFOR
B:= Position of left most 0 bit of BITMAP[]
Return 2^B
END
12
FM-Sketch Example
• Let the following represent data stream
a1
a2
a3
a4
• Let the hashed values be
H(a1)=011001
H(a2)=100101
H(a3)=101100
H(a4)=011011
• Then according to algorithm BITMAP will be equal to
BITMAP=11000000
• First occurrence of 0-bit is at position 2
F0 = 2 2 = 4
13
FM-Sketch (Contd..)
• If there are N distinct elements in a data stream:
• If i>>Log(N) then BITMAP[i] is certainly 0
• If i<<log(N) then BITMAP[i] is certainly 1
• For I ~ log(N) BITMAP[i] is a fringes of 0s and 1’s
• This algorithm is tested M online documentations of UNIX system
• Which has total 26692 lines
• 16405 lines where distinct
• After hashing the lines the following BITMAP was obtained
BITMAP= 111111111111001100000000
• Left most 0 appeared at position 12 and right most 1 appeared at position 15
• 214= 16384
• To improve the accuracy, the algorithm is extended by taking an array of bit
strings instead of one and the position of 0 is averaged.
14
Calculating F0 (KMV Algorithm)
• The problem with algorithm based on FM-Sketch is that they assume
there exist ideal hash functions that uniformly distributes data into
hash space
• But in real it is difficult to get such hash function
• Bar-Yossef et al. in [4], introduces k-minimum value algorithm for
determining number of distinct elements in data stream.
• uses a similar hash function h which is normalised to [0,1] as h:[m] →
[0,1].
15
Calculating F0 (KMV Algorithm)
• A limit t is fixed to number of values in hash space.
• t is assumed of the order 𝑂(1 Ɛ2 )
• At any point hash space contain t smallest hash values
• Ѵ= 𝑀𝑎𝑥(ℎ 𝑎𝑖 ) is maximum of the hashed values
• Ѵ is used to calculate F’0 using the below formula
.
𝐹0′ = 𝑡 Ѵ
16
Calculating F0 (KMV Algorithm)
Initialize First t values of KMV
for a in a1 to an do
if h(a) < Max(KMV ) then
Remove Max(KMV) from KMV set
Insert h(a) to KMV
end if
end for
V=Max(KMV )
return t/V
end
17
Example(KMV Algorithm)
1
0
0.5
• Let 8 distinct values of the stream be hashed as shown above.
• Let t=4, then we keep only least 4 hashed values. (Highlighted in red)
• This means, V=Max(first 4 hashed values) ~ 0.5
• F0= t/V = 4/0.5 = 8
18
Complexity of algorithms
• Each hash value requires space of order 𝑂(log 𝑚 ) memory bits.
• Number of hash values (t) is of the order 𝑂(1 Ɛ2 )
• Therefore KMV algorithm can be implemented in 𝑂(1 Ɛ2 . log 𝑚 )
memory bits space.
• The access time can be reduced if we store the t hash values in a
binary tree
• Thus the time complexity will be reduced to 𝑂(log(1 Ɛ). log 𝑚 ).
19
Calculating Fk
• Alon et al. estimates Fk by defining random variables X that can be
computed within given space and time.
• The approximate value of Fk is the expectation of the random variable
X, E(X).
• Construct a random variable X as follows
• Select ap be a random member of sequence A with index at ‘p’.
𝑎𝑝 = 𝑙 ∈ 1,2,3 … , 𝑛 .
• Let 𝑟 = |{𝑞 ∶ 𝑞 ≥ 𝑝, 𝑎𝑝 = 𝑙}|, represents the number of occurrences of 𝑙
within the members of the sequence A following 𝑎𝑝 .
• Random variable 𝑋 = 𝑚(𝑟 𝑘 − 𝑟 − 1 𝑘 )
20
Calculating FK (Contd…)
1
8𝑘𝑛1− 𝑘
1
𝑛1− 𝑘
• Let 𝑆1 =
𝜆2 which is of the order
𝜆2 and 𝑆2 =
2log(1 𝜀) which is of the order (log(1 𝜀) )
• Algorithm takes S2 random variables Y1, Y2,… YS2 and outputs median Y
• Where Yi is the average of Xij 1 ≤ j ≤ S1
• Next we calculate Fk by calculating E(X).
21
Calculating FK (Contd…)
𝑛
𝑚𝑖
(𝑗𝑘 − (𝑗 − 1)𝑘
𝐸 𝑋 =
= (1𝑘 + 2𝑘 − 1
𝑖=1 𝑗=1
𝑘
+ ⋯ + (𝑚1 𝑘 − 𝑚1 − 1 𝑘 )
+(1𝑘 + 2𝑘 − 1𝑘 + ⋯ + (𝑚2 𝑘 − 𝑚2 − 1 𝑘 ))
+⋯
+(1𝑘 + 2𝑘 − 1𝑘 + ⋯ + (𝑚𝑛 𝑘 − 𝑚𝑛 − 1 𝑘 ))
=
𝑛
𝑘
𝑚
𝑖=1 𝑖
= 𝐹𝑘
22
Complexity of Fk
• Each random variable X Stores ap and r
• So space required for X can be of the order O(log(m) + log(n))
• There are S1 x S2 random variables
• Hence total space complexity the algorithm takes is of the order
1
𝑘𝑙𝑜𝑔
1
1−𝑘
𝜀
𝑂
𝑛 (log 𝑛 + log 𝑚)
2
𝜆
23
Calculating F2
• Using previous discussed algorithm we can compute F2 in
𝑂( 𝑛 (log 𝑚 + log 𝑛) bits.
• Alon et al. in their paper simplified this algorithm using four-wise
independent random variables.
• The complexity of algorithm is reduced to the following
1
𝑙𝑜𝑔
𝜀 (log 𝑛 + log 𝑚)
𝑂
𝜆2
24
Reference
1.
2.
3.
4.
5.
6.
7.
Alon, Noga, Yossi Matias, and Mario Szegedy. 'The Space Complexity Of Approximating The
Frequency Moments'. Journal of Computer and System Sciences 58.1 (1999): 137-147.
Woodruff, David. 'Frequency Moments'. (2005): 2-3.
Indyk, Piotr, and Woodruff David. 'Optimal Approximations Of The Frequency Moments Of
Data Streams'. Proceedings of the thirty-seventh annual ACM symposium on Theory of
computing - STOC '05(2005): 202.
Ziv, Bar-Yossef et al. 'Counting Distinct Elements In A Data Stream.'. International Workshop on
Randomization and Approximation Techniques 2483 (2002): 1-10.
Philippe, Flajolet, and Nigel Martin G. 'Probablistic Counting Algorithms For Database
Applications'. Journal of computer and system sciences 31.2 (1985): 182-209.
Morris, Robert. 'Counting Large Numbers Of Events In Small Registers'. Communications of the
ACM 21.10 (1978): 840-842.
Flajolet, Philippe. 'Approximate Counting: A Detailed Analysis'. BIT 25.1 (1985): 113-134.
25
Thank you!
26
Download