Ori Rottenstreich
Joint work with Isaac Keslassy
Technion , Israel
Problem Definition user cost = 1
S local cache x z cost = 10 cost = 10
M central memory with all elements x y z
• Requirement: A data structure in user with fast answer to
• Solutions: o O(n) – Searching in a list o O(log(n)) – Searching in a sorted list o O(1) – But with false positives / negatives u y v user
2
Two Possible Errors
• False Positive: but the data structure answers
• Results in a redundant access to the local cache.
y
Additional cost of 1.
• False Negative: but the data structure answers
• Results in an expensive access to the central memory instead of the local cache.
x
Additional cost of 10-1=9.
3
Bloom Filters (Bloom, 1970)
• Initialization: Array of zero bits.
0 0 0 0 0 0 0 0 0 0
0
0
• Insertion: Each of the elements is hashed times, the corresponding bits are set.
• Query: Hashing the element, checking that all bits are set.
1 x
1
1 1
1 y
1
0 0 0 0 0 0 0
1 x
1
1
1 1 z
1 w
1
1
1
• False positive rate (probability) of .
• No false negatives.
4
Bloom Filters are Widely Used
• Cache/Memory Framework
• Packet Classification
• Intrusion Detection
• Routing
• Accounting
• Beyond networking: Spell Checking, DNA Classification
• Can be found in o Google's web browser Chrome o Google's database system BigTable o Facebook's distributed storage system Cassandra o Mellanox's IB Switch System
5
The Bloom Paradox
Sometimes, it is better to disregard the Bloom filter results, and in fact not to even query it, thus making the Bloom filter useless .
6
Outline
Introduction to Bloom Filters
The Bloom Paradox o The Bloom Paradox in Bloom Filters o Analysis of the Bloom Paradox o The Bloom Paradox in the Counting Bloom Filter
Summary
7
Bloom Paradox Example
Bloom filter
• Parameters:
• Extreme case without locality: All elements with equal probability of belonging to the cache.
o Toy example
8
Bloom Paradox Example
• Parameters:
• Let be the set of elements that the Bloom filter indicates are in o In particular, no false negatives in Bloom filter
• Intuition: cost = 1
S local cache x z
.
cost = 10 cost = 10
.
M central memory with all elements x y z u v
9
Bloom Paradox Example
• Parameters:
• Let be the set of elements that the Bloom filter indicates are in o In particular, no false negatives in Bloom filter
• Surprise:
B
Bloom filter cost = 1
S local cache x z
.
cost = 10 cost = 10
.
central memory with all elements x y
M z u v
9
Bloom Paradox Example
• Parameters:
• Let be the set of elements that the Bloom filter indicates are in o In particular, no false negatives in Bloom filter
• Surprise:
B
Bloom filter
.
.
The Bloom filter indicates the membership of elements. Only of them are indeed in .
Bloom Paradox Example
• When the Bloom filter states that , it is wrong with probability
• Average cost if we listen to the Bloom filter:
• Average cost if we don ’t: = =
The Bloom filter is useless!
Don ’t listen to the Bloom filter
11
Outline
Introduction to Bloom Filters
The Bloom Paradox o The Bloom Paradox in Bloom Filters o Analysis of the Bloom Paradox o The Bloom Paradox in the Counting Bloom Filter
Summary
12
Costs of the Two Possible Errors
• The cost of a false positive : 1
• The cost of a false negative :
• In the cache example:
13
Conditions for the
Bloom Paradox
• Let be the a priori membership probability of o i.e. before getting the answer of the Bloom filter
• Intuition: The Bloom paradox occurs more often when: o is small local cache
Bloom filter central memory
14
Conditions for the
Bloom Paradox
• Let be the a priori membership probability of o i.e. before getting the answer of the Bloom filter
• Intuition: The Bloom paradox occurs more often when: o is small o is large (i.e. is small) local cache
Bloom filter central memory
14
Conditions for the
Bloom Paradox
• Let be the a priori membership probability of o i.e. before getting the answer of the Bloom filter
• Intuition: The Bloom paradox occurs more often when: o is small o is large (i.e. is small) o is small (because the Bloom filter implicitly assumes ) local cache
Bloom filter central memory
14
Conditions for the
Bloom Paradox
• Let be the a priori membership probability of o i.e. before getting the answer of the Bloom filter
• Intuition: The Bloom paradox occurs more often when: o o o is small (because the Bloom filter implicitly assumes )
• • Theorem 1 : : (for )
The Bloom paradox occurs if and only if
14
Bloom Filter Improvements
• Theorem 1 :
The Bloom paradox occurs if and only if
• Use the formula to improve the Bloom filter o Only insert / query Bloom filter if the formula expects it to be useful local cache
Bloom filter central memory
15
Bloom Filter Improvements
• Theorem 1 :
The Bloom paradox occurs if and only if
• Use the formula to improve the Bloom filter o Only insert / query Bloom filter if the formula expects it to be useful local cache
Bloom filter central memory
15
Outline
Introduction to Bloom Filters
The Bloom Paradox o The Bloom Paradox in Bloom Filters o Analysis of the Bloom Paradox o The Bloom Paradox in the Counting Bloom Filter
Summary
16
Counting Bloom Filters (CBFs)
• Bloom filters do not support deletions of elements. Simply resetting bits might cause false negatives.
x
1
1
1 1 y
1
1
0 1 0 1 0 0 0 0 0 0 0 0
• The solution: Counting Bloom filters - Storing array of counters instead of bits.
o Insertion: Incrementing counters by one.
o Deletion: Decrementing counters by one. o Query: Checking that counters are positive.
+1 x
+1 +1 +1 y
+1
+1
0 1 0 1 0 0 2 0 1 0 1 0
• The same false positive probability.
• Require too much memory, e.g. 57 bits per element for .
Counting Bloom Filter Query
• Query y o Checking that counters are positive.
0 1 0 2 5 0 1 8 3 0 2 1 y z o Question: Which is more likely to be correct? y or z?
18
The Bloom Paradox in the
Counting Bloom Filter
• Theorem 2 :
Let denote the values of the counters pointed by the set of hash functions. Then,
Only counters product matters!
19
CBF Based
Membership Probability
-Before checking CBF, a priori membership probability = ≈ 0.03
-CBF indicates counters product=8
a posteriori membership probability ≈ 0.69
• Parameters: n=3328, m = 28485, k=6
20
Experimental Results
• Internet trace (equinix-chicago) with real hash functions.
Counting Bloom filter parameters: n=2 10 , m / n = 30, k=5, 2 20 queries
21
Concluding Remarks
• Discovery of the Bloom paradox
• Importance of the a priori membership probability
• Using the counters product to estimate the correctness of a positive indication of the CBF
22