Document

advertisement
Filtry Blooma
jako
oszczędna struktura słownikowa
Szymon Grabowski
sgrabow@kis.p.lodz.pl
Instytut Informatyki Stosowanej, Politechnika Łódzka
kwiecień 2013
The idea of hashing
Hash table: store keys (and possibly satellite data),
the location is found via a hash function;
some collision resolution method needed.
chained hashing
open addressing
Hash structures are typically fast…
…but have a few drawbacks:
are randomized,
don’t allow for iteration in a sorted order,
and may require quite some space.
Now, if we require possibly small space, what can we do?
3
Bloom Filter
(Bloom, 1970)
Don’t store the keys themselves!
Just be able to answer if a given key is in the structure.
If the answer if “no”, it is correct.
But if the answer is “yes”, it may be wrong!
So, it’s a probabilistic data structure.
There’s a tradeoff between its space
(avg space per inserted key) and “truthfulness”.
4
Bloom Filter features
• little space, so save on RAM (mostly old apps);
• little space, so also fast to transfer the structure over network;
• little space, sometimes small enough to fit L2 (or L1!) CPU cache
(Li & Zhong, 2006: BF makes a Bayesian spam filter work
much faster thx to fitting an L2 cache);
• extremely simple / easy to implement;
• major application domains: databases, networking;
• …but also a few drawbacks / issues, hence a significant
interest in devising novel BF variants
5
Bloom Filter idea
The idea:
• keep a bit-vector of some size m, initially all zeros;
• use k independent hash functions (h.f.)
(instead of one, in a standard HT) for each added key;
• write 1 in the k locations pointed by the k h.f.;
• testing for a key: if in all k calculated locations there is 1,
then return “yes” (=the key exists), which may be wrong,
if among the k locations there’s at least one 0,
return “no”, which is always correct.
6
BF, basic API
insert(k)
exists(k)
No delete(k)!
And of course: no iteration over the keys
added to the BF (no content listing).
http://www.cl.cam.ac.uk/
research/srg/opera/
meetings/attachments/
2008-10-14-BloomFiltersSurvey.pdf
7
Early applications – spellchecking (1982, 1990),
hyphenation
If a spellcheck occasionally ignores
a word not in its dictionary – not a big problem.
This is exactly the case with BF in this app.
Quite a good app: the dictionary is static
(or almost static), so once we set the BF size,
we can estimate the error,
which practically doesn’t change.
App from Bloom’s paper:
program for automatic hyphenation in which 90%
of words can be hyphenated using simple rules,
but 10% require dictionary lookup.
8
Bloom speaking…
http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=7FB6933B782FBC9C98BBCDA0EB420935?doi=10.1.1.20.2080&rep=rep1&type=pdf
9
BF tradeoffs
The error grows with load
(i.e. with growing n / m, n is the # of added items).
When the BF is almost empty, the error is very small,
but then we also waste lots of space.
Another factor: k. How to choose it?
For any specified load (m set to the ‘expected’ n
in a given scenario) there is an optimal value of k
(such that minimizes the error).
k too small – too many collisions;
k too large – the bit vector gets too ‘dense’ quickly
(and too many collisions, too!)
10
Finding the best k
We assume the hash functions
choose each bit vector slot with equal prob.
Pr(a given bit NOT set by a given h.f.) = 1 – 1/m
Pr(a given bit NOT set) = (1 – 1/m)kn
m and kn are typically large, so
Pr(a given bit NOT set)  e–kn / m
Pr(a given bit is set) = 1 – (1 – 1/m)kn
Consider an element not added to the BF:
the filter will lie if all the corresponding k bits are set.
This is: Pr(a given bit is set)k =
(1 – (1 – 1/m)kn)k  (1 – e–kn / m)k
11
Finding the best k, cont’d
Again, Pr(the Bloom filter lies)  (1 – e–kn / m)k.
Clearly, the error grows with growing n (for fixed k, m)
and decreases with growing m (for fixed k, n).
What is the optimal k?
Differentiation (=calculating a derivative) helps.
The error is minimized for k = ln 2 * m / n  0.693 m / n.
(Then the # of 1s and 0s in the bit-vector is
approx. equal. Of course, k must be an integer!)
And the error (false positive rate, FPR) =
(1/2)k  (0.6185)m / n.
12
Minimizing the error in practice
m = 8n  error  0.0214
m = 12n  error  0.0031
m = 16n  error  0.0005
13
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000
FPR example, m / n = 8
14
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
Funny tricks with BF
Given two BFs, representing sets S1 and S2,
with the same # of bits and using the same hash functions,
we can represent the union of those sets by taking the OR
of the two bit-vectors of the original BFs.
Say you want to halve the memory use after some time
and assume the filter size is a power of 2.
Just OR the halves of the filter.
When hashing for a lookup, OR the lower and upper bits
of the hash value.
Intersection of two BFs (of the same size),
i.e. AND operation, can be used to approximate
the intersection of two sets.
15
Scalable BF
(Almeida et al., 2007)
We can find the optimal k knowing n / m in advance.
As m is settled once, we must know (roughly) n,
the number of items to add.
What if we have a pale idea of the size of n..?
If the initial m is too large, we may halve it easily
(see prev. slide). Crude, but possible.
What about m being too small?
Solution: when the filter gets ‘full’ (reaches the limit
on the fill ratio), a new one is added, with tighter
max FPR, and querying translates to
testing at most all of those filters…
16
How to approximate a set
without knowing its size in advance
ε – max allowed false positive rate
Classic result: BF (and some other related structures)
offers (n log(1/ε))-bit solution, when n is known in advance.
Pagh, Segev, Wider (2013):
17
Semi-join operation in a distributed database
database A
database B
Empl
Salary
Addr
City
City
Cost of living
John
60K
…
New York
New York
60K
George
30K
…
New York
Chicago
55K
Moe
25K
…
Topeka
Topeka
30K
Alice
70K
…
Chicago
Raul
30K
Chicago
Task: Create a table of all employees that make < 40K
and live in city where COL > 50K.
Empl
Salary
Addr
City
COL
Semi-join: send (from A to B) just (City)
Anything better?
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
18
Bloom-join
• BF-based solution: A sends a Bloom filter
instead of actual city names,
• …then B sends back its answers…
• …from which A filters out the false positives
This is to minimize transfer (over a network)
between the database sites!
The CPU work is increased:
B needs to filter its city list using the received filter,
A needs to filter its received list of persons.
19
P2P keyword search
(Reynolds & Vadhat, 2003)
• distributed inverted index on words, multi-word queries,
• Peer A holds list of document IDs containing Word1,
Peer B holds list for Word2,
• intersection needed, but minimize communication,
• A sends B a Bloom filter of document list,
• B sends back possible intersections to A,
• A verifies and sends the true result to user,
• i.e. equivalent to Bloom-join
20
Distributed Web caches (Fan et al., 2000)
Web Cache 1
Web Cache 2
Web Cache 3
21
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
k-mer counting in bioinformatics
k-mers: substrings of length k in a DNA sequence.
Counting them is important: for genome de novo assemblers
(based on a de Brujin graph),
for detection of repeated sequences,
to study the mechanisms of sequence duplication
in genomes, etc.
22
http://www.homolog.us/blogs/wp-content/uploads/2011/07/i6.png
BFCounter algorithm
(Melsted & Pritchard, 2011)
The considered problem variant:
find all non-unique k-mers in reads collection, with their counts.
I.e. ignore k-mers with occ = 1 ( almost certainly noise).
Input data: 2.66G 36bp Illumina reads (40-fold coverage).
(Output) statistics:
12.18G k-mers (k = 25) present in the sequencing reads,
of which 9.35G are unique
and 2.83G have coverage of two or greater.
23
BFCounter idea
(Melsted & Pritchard, 2011)
Both a Bloom filter and a plain hash table used.
Bloom filter B used to store implicitly all k-mers seen so far,
while only inserting non-unique k-mers into the hash table T.
For each k-mer x, we check if x is in B.
If not, we update the appropriate bits in B,
to indicate that it has now been observed.
If x is in B, then we check if it is in T,
and if not, we add it to T (with freq = 2).
What about false positives?
24
BFCounter idea, cont’d
(Melsted & Pritchard, 2011)
After the first pass through
the sequence data, one can re-iterate
over the sequence data to obtain exact k-mer counts in T
(and then delete all unique k-mers).
Extra time for this second round: at most 50% of the total time,
And tends to be less since hash table lookups are generally
faster than insertions.
Approximate version possible: no re-iteration
(i.e. coverage counts for some k-mers will be higher by 1
than their true value).
25
Memory usage for chr21 (Melsted & Pritchard, 2011)
26
BF, cache access
Negative answer: ½ chance that the first probed bit is 0,
then we terminate (i.e., 1 cache miss – in rare cases 0).
On avg with a negative answer: (almost) 2 cache misses.
Good (and hard to improve).
Positive answer: (almost) k misses on avg. 
A problem not really addressed until quite recently…
27
Blocked Bloom filters
(Putze et al., 2007, 2009)
The idea:
first h.f. determines the cache line
(of typical size 64B = 512 bits nowadays),
the next k–1 h.f. are used to set or test bits (as usual)
but only inside this one block.
I.e. (up to) one cache miss always!
Drawback: FPR slightly larger than with plain BF
for the same c := m / n and k.
And the loss grows with growing c…
(even if smaller k is chosen for large c,
which helps somewhat).
28
Blocked Bloom filters, cont’d
(Putze et al., 2007, 2009)
I.e. if c < 20 (the top row), then the space
grows usually by <20%
compared to the plain BF, with comparable FPR.
Unfortunately, for large c (rarely needed?)
the loss is very significant.
The idea of blocking for BF was first suggested
in (Manber & Wu, 1994), for storing the filter on disk.
29
Counting Bloom filter
(Fan et al., 1998, 2000)
BF with delete:
use small counters instead of single bits.
BF[pos]++ at insert, BF[pos]-- at del.
E.g. 4 bits: up to count 15.
Problem: counter overflow
(plain solution: freeze the given counter).
Another (obvious) problem: more space, eg. 4 times.
4-bit counters and k < ln 2 (m / n) 
probability of overflow  1.37e–15 * m
30
CBF, another problem…
A deletion instruction for a false positive item
(a.k.a. incorrect deletion of a false positive item) 
may produce false negative items!
Problem widely discussed and analyzed in (Guo et al., 2010)
31
Deletable Bloom filter (DlBF)
(Rotherberg et al., 2010)
Cute observation:
those of the k bits for an item x which don’t have a collision
may be safely unset.
If at least one of those k bits is such,
then we’ve managed to delete x!
How to distinguish colliding (overlapping) set bits
from non-colliding ones?
One extra bit per location? Quite costly…
32
Deletable Bloom filter, cont’d
(Rotherberg et al., 2010)
Compromise solution:
divide the bit-vector into small areas;
iff no collision in an area happen then mark it
as a collision-free area.
33
DlBF, deletability prob. as a function of filter density
34
Compressed Bloom filter
(Mitzenmacher, 2002)
If RAM is not an issue, but we want to transmit the filter
over a network…
Mitzenmacher noticed it pays to use more space,
incl. more 0 bits (i.e. the structure is more sparse),
as then the bit-vector becomes compressible.
(In a plain BF the numbers of 0s and 1s are approx equal
 practically incompressible.)
m / n increased from 16 to 48: after compression
approx. the same size, but the FPR drops twice
35
Conclusions
Bloom Filter is alive and kicking!
Lots of applications and lots of new variants.
In theory: constant FP rate and
constant number of bits per key.
In practice: always think what FP rate you can allow.
Also: what the errors mean (erroneous results
or „only” increased processing time for false positives?).
Bottom line: succinct data structure for Big Data.
36
Download