Machine Translated by Google Thuchanh bai5 2 Flayolet Martin Computer Architecture and Operating Systems (National Economics University) Studocu is not sponsored or endorsed by any college or university Downloaded by Rick Dalton (samurainightlego@gmail.com) lOMoARcPSD|32874845 Machine Translated by Google MINING OF MASSIVE DATASETS: PRACTICE LESSON 5 – PART 1. FLAJOLET-MARTIN: APPROXIMATE THE NUMBER OF DISTINCT ELEMENTS In this exercise, we will use the Flajolet-Martin algorithm to estimate (approximate) the number of elements. different within a range of input data. Illustrative data: The test data set in the illustrated program is only self-generated data. Students develop chapters Program required by the self-practice section to apply to real data. Repeat: - Original algorithm: o Use a hash function to hash the input data into binary strings. o For each element Aj , take the length of the consecutive 0 bit string on the right => denoted r(Aj ). o Take R = max(r(Aj )) up to the current element. O The number of different elements so far is approximated: count ÿ 2R . - Modified algorithm 1. o Use m different hash functions hj ; j = 1, 2, …, m. o Hash the data as above, but get the results with the different hj hash functions mentioned above o Assuming the maximum value R is now Rj, we take R = (ÿRj )/m – average O The number of different elements so far is approximated: count ÿ m*2R . - LogLog algorithm o Use a hash function to hash the input data into binary strings. o Grouping data based on a certain attribute, in the original LogLog algorithm, people Use the first few bits (left) of the hash result to group these results into m buckets. o Find Ri = max(r(Aj )) but with separate Ajs for each group Gi. o Find R = median of the series Ri O The number of different elements so far is approximated: count ÿ m*2R . 1. Code illustrating the original algorithm in Python: In the example below, we create a short data range (array of numbers) for illustration. We only use 01 function Self-generated hash: h(A) = (1*A+ 6) mod 32 – You can change. The program will also use regular counting for comparison. stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1,7,6,5,2,1] # Check in the usual way unmarketable uh print('Using conventional Algorithm:') start_time = time.time() st_unique=[] for i in stream: Downloaded by Rick Dalton (samurainightlego@gmail.com) lOMoARcPSD|32874845 Machine Translated by Google if i in st_unique: continue else: st_unique.append(i) print('distinct elements',len(st_unique)) print("--- %s seconds ---" % (time.time() - start_time)) # Follow Martin-Flajolet unmarketable print('Using Flajolet Martin Algorithm:') import time start_time = time.time() maxnum=0 for i in range(0,len(stream)): val= bin((1*stream[i] + 6) % 32)[2:] sum=0 for j in range(len(val)-1,0,-1): if val[j]=='0': sum+=1 else: break if sum>maxnum: maxnum=sum print('distict elements', 2**maxnum) print("--- %s seconds ---" % (time.time() - start_time)) Practical exercise: 1) Implement the original algorithm with a different data sequence and different hash functions 2) Build about 04 hash functions according to the above sample, then perform the modified algorithm and compare the resul fruit Downloaded by Rick Dalton (samurainightlego@gmail.com) lOMoARcPSD|32874845 Machine Translated by Google Downloaded by Rick Dalton (samurainightlego@gmail.com) lOMoARcPSD|32874845 Machine Translated by Google LOGLOG CONDITIONING ALGORITHM In the second example, we use data as text files quotes_2008-12.txt and quotes_2009-01.txt (provided by the teacher or obtained at the link: http://snap.stanford.edu/data/bigdata /memetracker9/quotes_2008- 12.txt.gz and http://snap.stanford.edu/data/bigdata/memetracker9/quotes_2009-01.txt.gz ). Because the data is in text format, we will always use the mmh3 hash function library. We will consider the words in each data file to belong to 1 group. The source code of the Flajolet-Martin LogLog algorithm is as follows: # -*- coding: utf-8 -*""" Created on Wed May 4 06:42:35 2022 @author: WinIF Chung """ from bitarray import bitarray import mmh3 import statistics import math def trailing_zeros(n): s = str(n) return len(s)-len(s.rstrip('0')) input_file = ['quotes_2008-12.txt', 'quotes_2009-01.txt'] result = [ "" for i in range(2)] result_tail = [[] for i in range(2)] for i in input_file: fp = open(i,"r", encoding='ISO-8859-1') for line in fp: stream = line.split("\t") if stream[0] is 'Q': for seed in range(2): result[seed] = format(abs(mmh3.hash(stream[1], seed)), '032b') result_tail[seed].append(trailing_zeros(result[seed])) fp.close() group1 = (2**(max(result_tail[0]))) group2 = (2**(max(result_tail[1]))) print (math.ceil(statistics.median([group1, group2]))) Downloaded by Rick Dalton (samurainightlego@gmail.com)