Uploaded by jothishprabu8

flayolet-martin English

advertisement
Machine Translated by Google
Thuchanh bai5 2 Flayolet Martin
Computer Architecture and Operating Systems (National Economics University)
Studocu is not sponsored or endorsed by any college or university
Downloaded by Rick Dalton (samurainightlego@gmail.com)
lOMoARcPSD|32874845
Machine Translated by Google
MINING OF MASSIVE DATASETS: PRACTICE LESSON 5 – PART 1.
FLAJOLET-MARTIN: APPROXIMATE THE NUMBER OF DISTINCT ELEMENTS
In this exercise, we will use the Flajolet-Martin algorithm to estimate (approximate) the number of elements.
different within a range of input data.
Illustrative data:
The test data set in the illustrated program is only self-generated data. Students develop chapters
Program required by the self-practice section to apply to real data.
Repeat:
- Original algorithm:
o Use a hash function to hash the input data into binary strings.
o For each element Aj , take the length of the consecutive 0 bit string on the right => denoted r(Aj ).
o Take R = max(r(Aj )) up to the current element.
O
The number of different elements so far is approximated: count ÿ 2R
.
- Modified algorithm 1.
o Use m different hash functions hj ; j = 1, 2, …,
m.
o Hash the data as above, but get the results with the different hj hash functions mentioned above
o Assuming the maximum value R is now Rj, we take R = (ÿRj )/m – average
O
The number of different elements so far is approximated: count ÿ m*2R
.
- LogLog algorithm
o Use a hash function to hash the input data into binary strings.
o Grouping data based on a certain attribute, in the original LogLog algorithm, people
Use the first few bits (left) of the hash result to group these results into m buckets.
o Find Ri = max(r(Aj )) but with separate Ajs for each group Gi.
o Find R = median of the series Ri
O
The number of different elements so far is approximated: count ÿ m*2R
.
1. Code illustrating the original algorithm in Python:
In the example below, we create a short data range (array of numbers) for illustration. We only use 01 function
Self-generated hash: h(A) = (1*A+ 6) mod 32 – You can change.
The program will also use regular counting for comparison.
stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1,7,6,5,2,1]
# Check in the usual way
unmarketable
uh
print('Using conventional Algorithm:')
start_time = time.time()
st_unique=[]
for i in stream:
Downloaded by Rick Dalton (samurainightlego@gmail.com)
lOMoARcPSD|32874845
Machine Translated by Google
if i in st_unique:
continue
else:
st_unique.append(i)
print('distinct elements',len(st_unique))
print("--- %s seconds ---" % (time.time() - start_time))
# Follow Martin-Flajolet
unmarketable
print('Using Flajolet Martin Algorithm:')
import time
start_time = time.time()
maxnum=0
for i in range(0,len(stream)):
val= bin((1*stream[i] + 6) % 32)[2:]
sum=0
for j in range(len(val)-1,0,-1):
if val[j]=='0':
sum+=1
else:
break
if sum>maxnum:
maxnum=sum
print('distict elements', 2**maxnum)
print("--- %s seconds ---" % (time.time() - start_time))
Practical exercise:
1) Implement the original algorithm with a different data sequence and different hash functions
2) Build about 04 hash functions according to the above sample, then perform the modified algorithm and compare the resul
fruit
Downloaded by Rick Dalton (samurainightlego@gmail.com)
lOMoARcPSD|32874845
Machine Translated by Google
Downloaded by Rick Dalton (samurainightlego@gmail.com)
lOMoARcPSD|32874845
Machine Translated by Google
LOGLOG CONDITIONING ALGORITHM
In the second example, we use data as text files quotes_2008-12.txt and quotes_2009-01.txt (provided by the
teacher or obtained at the link: http://snap.stanford.edu/data/bigdata /memetracker9/quotes_2008- 12.txt.gz and
http://snap.stanford.edu/data/bigdata/memetracker9/quotes_2009-01.txt.gz ).
Because the data is in text format, we will always use the mmh3 hash function library. We will consider the words
in each data file to belong to 1 group. The source code of the Flajolet-Martin LogLog algorithm is as follows:
# -*- coding: utf-8 -*"""
Created on Wed May 4 06:42:35 2022
@author: WinIF Chung
"""
from bitarray import bitarray
import mmh3
import statistics
import math
def trailing_zeros(n):
s = str(n)
return len(s)-len(s.rstrip('0'))
input_file = ['quotes_2008-12.txt', 'quotes_2009-01.txt']
result = [ "" for i in range(2)]
result_tail = [[] for i in range(2)]
for i in input_file:
fp = open(i,"r", encoding='ISO-8859-1')
for line in fp:
stream = line.split("\t")
if stream[0] is 'Q':
for seed in range(2):
result[seed] = format(abs(mmh3.hash(stream[1], seed)), '032b')
result_tail[seed].append(trailing_zeros(result[seed]))
fp.close()
group1 = (2**(max(result_tail[0])))
group2 = (2**(max(result_tail[1])))
print (math.ceil(statistics.median([group1, group2])))
Downloaded by Rick Dalton (samurainightlego@gmail.com)
Download