Filtry Blooma jako oszczędna struktura słownikowa Szymon Grabowski sgrabow@kis.p.lodz.pl Instytut Informatyki Stosowanej, Politechnika Łódzka kwiecień 2013 The idea of hashing Hash table: store keys (and possibly satellite data), the location is found via a hash function; some collision resolution method needed. chained hashing open addressing Hash structures are typically fast… …but have a few drawbacks: are randomized, don’t allow for iteration in a sorted order, and may require quite some space. Now, if we require possibly small space, what can we do? 3 Bloom Filter (Bloom, 1970) Don’t store the keys themselves! Just be able to answer if a given key is in the structure. If the answer if “no”, it is correct. But if the answer is “yes”, it may be wrong! So, it’s a probabilistic data structure. There’s a tradeoff between its space (avg space per inserted key) and “truthfulness”. 4 Bloom Filter features • little space, so save on RAM (mostly old apps); • little space, so also fast to transfer the structure over network; • little space, sometimes small enough to fit L2 (or L1!) CPU cache (Li & Zhong, 2006: BF makes a Bayesian spam filter work much faster thx to fitting an L2 cache); • extremely simple / easy to implement; • major application domains: databases, networking; • …but also a few drawbacks / issues, hence a significant interest in devising novel BF variants 5 Bloom Filter idea The idea: • keep a bit-vector of some size m, initially all zeros; • use k independent hash functions (h.f.) (instead of one, in a standard HT) for each added key; • write 1 in the k locations pointed by the k h.f.; • testing for a key: if in all k calculated locations there is 1, then return “yes” (=the key exists), which may be wrong, if among the k locations there’s at least one 0, return “no”, which is always correct. 6 BF, basic API insert(k) exists(k) No delete(k)! And of course: no iteration over the keys added to the BF (no content listing). http://www.cl.cam.ac.uk/ research/srg/opera/ meetings/attachments/ 2008-10-14-BloomFiltersSurvey.pdf 7 Early applications – spellchecking (1982, 1990), hyphenation If a spellcheck occasionally ignores a word not in its dictionary – not a big problem. This is exactly the case with BF in this app. Quite a good app: the dictionary is static (or almost static), so once we set the BF size, we can estimate the error, which practically doesn’t change. App from Bloom’s paper: program for automatic hyphenation in which 90% of words can be hyphenated using simple rules, but 10% require dictionary lookup. 8 Bloom speaking… http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=7FB6933B782FBC9C98BBCDA0EB420935?doi=10.1.1.20.2080&rep=rep1&type=pdf 9 BF tradeoffs The error grows with load (i.e. with growing n / m, n is the # of added items). When the BF is almost empty, the error is very small, but then we also waste lots of space. Another factor: k. How to choose it? For any specified load (m set to the ‘expected’ n in a given scenario) there is an optimal value of k (such that minimizes the error). k too small – too many collisions; k too large – the bit vector gets too ‘dense’ quickly (and too many collisions, too!) 10 Finding the best k We assume the hash functions choose each bit vector slot with equal prob. Pr(a given bit NOT set by a given h.f.) = 1 – 1/m Pr(a given bit NOT set) = (1 – 1/m)kn m and kn are typically large, so Pr(a given bit NOT set) e–kn / m Pr(a given bit is set) = 1 – (1 – 1/m)kn Consider an element not added to the BF: the filter will lie if all the corresponding k bits are set. This is: Pr(a given bit is set)k = (1 – (1 – 1/m)kn)k (1 – e–kn / m)k 11 Finding the best k, cont’d Again, Pr(the Bloom filter lies) (1 – e–kn / m)k. Clearly, the error grows with growing n (for fixed k, m) and decreases with growing m (for fixed k, n). What is the optimal k? Differentiation (=calculating a derivative) helps. The error is minimized for k = ln 2 * m / n 0.693 m / n. (Then the # of 1s and 0s in the bit-vector is approx. equal. Of course, k must be an integer!) And the error (false positive rate, FPR) = (1/2)k (0.6185)m / n. 12 Minimizing the error in practice m = 8n error 0.0214 m = 12n error 0.0031 m = 16n error 0.0005 13 http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000 FPR example, m / n = 8 14 www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt Funny tricks with BF Given two BFs, representing sets S1 and S2, with the same # of bits and using the same hash functions, we can represent the union of those sets by taking the OR of the two bit-vectors of the original BFs. Say you want to halve the memory use after some time and assume the filter size is a power of 2. Just OR the halves of the filter. When hashing for a lookup, OR the lower and upper bits of the hash value. Intersection of two BFs (of the same size), i.e. AND operation, can be used to approximate the intersection of two sets. 15 Scalable BF (Almeida et al., 2007) We can find the optimal k knowing n / m in advance. As m is settled once, we must know (roughly) n, the number of items to add. What if we have a pale idea of the size of n..? If the initial m is too large, we may halve it easily (see prev. slide). Crude, but possible. What about m being too small? Solution: when the filter gets ‘full’ (reaches the limit on the fill ratio), a new one is added, with tighter max FPR, and querying translates to testing at most all of those filters… 16 How to approximate a set without knowing its size in advance ε – max allowed false positive rate Classic result: BF (and some other related structures) offers (n log(1/ε))-bit solution, when n is known in advance. Pagh, Segev, Wider (2013): 17 Semi-join operation in a distributed database database A database B Empl Salary Addr City City Cost of living John 60K … New York New York 60K George 30K … New York Chicago 55K Moe 25K … Topeka Topeka 30K Alice 70K … Chicago Raul 30K Chicago Task: Create a table of all employees that make < 40K and live in city where COL > 50K. Empl Salary Addr City COL Semi-join: send (from A to B) just (City) Anything better? www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt 18 Bloom-join • BF-based solution: A sends a Bloom filter instead of actual city names, • …then B sends back its answers… • …from which A filters out the false positives This is to minimize transfer (over a network) between the database sites! The CPU work is increased: B needs to filter its city list using the received filter, A needs to filter its received list of persons. 19 P2P keyword search (Reynolds & Vadhat, 2003) • distributed inverted index on words, multi-word queries, • Peer A holds list of document IDs containing Word1, Peer B holds list for Word2, • intersection needed, but minimize communication, • A sends B a Bloom filter of document list, • B sends back possible intersections to A, • A verifies and sends the true result to user, • i.e. equivalent to Bloom-join 20 Distributed Web caches (Fan et al., 2000) Web Cache 1 Web Cache 2 Web Cache 3 21 www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt k-mer counting in bioinformatics k-mers: substrings of length k in a DNA sequence. Counting them is important: for genome de novo assemblers (based on a de Brujin graph), for detection of repeated sequences, to study the mechanisms of sequence duplication in genomes, etc. 22 http://www.homolog.us/blogs/wp-content/uploads/2011/07/i6.png BFCounter algorithm (Melsted & Pritchard, 2011) The considered problem variant: find all non-unique k-mers in reads collection, with their counts. I.e. ignore k-mers with occ = 1 ( almost certainly noise). Input data: 2.66G 36bp Illumina reads (40-fold coverage). (Output) statistics: 12.18G k-mers (k = 25) present in the sequencing reads, of which 9.35G are unique and 2.83G have coverage of two or greater. 23 BFCounter idea (Melsted & Pritchard, 2011) Both a Bloom filter and a plain hash table used. Bloom filter B used to store implicitly all k-mers seen so far, while only inserting non-unique k-mers into the hash table T. For each k-mer x, we check if x is in B. If not, we update the appropriate bits in B, to indicate that it has now been observed. If x is in B, then we check if it is in T, and if not, we add it to T (with freq = 2). What about false positives? 24 BFCounter idea, cont’d (Melsted & Pritchard, 2011) After the first pass through the sequence data, one can re-iterate over the sequence data to obtain exact k-mer counts in T (and then delete all unique k-mers). Extra time for this second round: at most 50% of the total time, And tends to be less since hash table lookups are generally faster than insertions. Approximate version possible: no re-iteration (i.e. coverage counts for some k-mers will be higher by 1 than their true value). 25 Memory usage for chr21 (Melsted & Pritchard, 2011) 26 BF, cache access Negative answer: ½ chance that the first probed bit is 0, then we terminate (i.e., 1 cache miss – in rare cases 0). On avg with a negative answer: (almost) 2 cache misses. Good (and hard to improve). Positive answer: (almost) k misses on avg. A problem not really addressed until quite recently… 27 Blocked Bloom filters (Putze et al., 2007, 2009) The idea: first h.f. determines the cache line (of typical size 64B = 512 bits nowadays), the next k–1 h.f. are used to set or test bits (as usual) but only inside this one block. I.e. (up to) one cache miss always! Drawback: FPR slightly larger than with plain BF for the same c := m / n and k. And the loss grows with growing c… (even if smaller k is chosen for large c, which helps somewhat). 28 Blocked Bloom filters, cont’d (Putze et al., 2007, 2009) I.e. if c < 20 (the top row), then the space grows usually by <20% compared to the plain BF, with comparable FPR. Unfortunately, for large c (rarely needed?) the loss is very significant. The idea of blocking for BF was first suggested in (Manber & Wu, 1994), for storing the filter on disk. 29 Counting Bloom filter (Fan et al., 1998, 2000) BF with delete: use small counters instead of single bits. BF[pos]++ at insert, BF[pos]-- at del. E.g. 4 bits: up to count 15. Problem: counter overflow (plain solution: freeze the given counter). Another (obvious) problem: more space, eg. 4 times. 4-bit counters and k < ln 2 (m / n) probability of overflow 1.37e–15 * m 30 CBF, another problem… A deletion instruction for a false positive item (a.k.a. incorrect deletion of a false positive item) may produce false negative items! Problem widely discussed and analyzed in (Guo et al., 2010) 31 Deletable Bloom filter (DlBF) (Rotherberg et al., 2010) Cute observation: those of the k bits for an item x which don’t have a collision may be safely unset. If at least one of those k bits is such, then we’ve managed to delete x! How to distinguish colliding (overlapping) set bits from non-colliding ones? One extra bit per location? Quite costly… 32 Deletable Bloom filter, cont’d (Rotherberg et al., 2010) Compromise solution: divide the bit-vector into small areas; iff no collision in an area happen then mark it as a collision-free area. 33 DlBF, deletability prob. as a function of filter density 34 Compressed Bloom filter (Mitzenmacher, 2002) If RAM is not an issue, but we want to transmit the filter over a network… Mitzenmacher noticed it pays to use more space, incl. more 0 bits (i.e. the structure is more sparse), as then the bit-vector becomes compressible. (In a plain BF the numbers of 0s and 1s are approx equal practically incompressible.) m / n increased from 16 to 48: after compression approx. the same size, but the FPR drops twice 35 Conclusions Bloom Filter is alive and kicking! Lots of applications and lots of new variants. In theory: constant FP rate and constant number of bits per key. In practice: always think what FP rate you can allow. Also: what the errors mean (erroneous results or „only” increased processing time for false positives?). Bottom line: succinct data structure for Big Data. 36