The Bloom Paradox

advertisement
The Bloom Paradox
Ori Rottenstreich
Joint work with
Yossi Kanizo and Isaac Keslassy
Technion, Israel
Problem Definition
yx
user
x
y
cost = 1
S
cost = 10
local cache
x
cost = 10
M
central memory with
all elements
x
y
z
u
v
z
• Requirement: A data structure in user with fast answer to
• Solutions:
o O(n) – Searching in a list
o O(log(n)) – Searching in a sorted list
o O(1) – But with false positives / negatives
y
user
2
Two Possible Errors
• False Positive:
but the data structure answers
• Results in a redundant access to the local cache.
y
 Additional cost of 1.
• False Negative:
but the data structure answers
• Results in an expensive access to the central memory instead of
the local cache.
x
 Additional cost of 10-1=9.
3
Bloom Filters (Bloom, 1970)
• Initialization: Array of
0
0
zero bits.
0
0
0
0
0
0
0
0
0
0
• Insertion: Each of the elements is hashed times, the
corresponding bits are set.
• Query: Hashing the element, checking that all bits are set.
y
x
1
0
1
0
0
1
x
1
1
1
1
0
0
1
11
0
1
0
1
z
0
1
1
1
1
0
0
11
1
w
1
0
0
• False positive rate (probability) of
• No false negatives
4
Bloom Filters are Widely Used
•
•
•
•
•
•
Cache/Memory Framework
Packet Classification
Intrusion Detection
Routing
Accounting
Beyond networking: Spell Checking, DNA Classification
• Can be found in
o Google's web browser Chrome
o Google's database system BigTable
o Facebook's distributed storage system Cassandra
o Mellanox's IB Switch System
5
Outline
 Introduction to Bloom Filters
 The Bloom Paradox
 The Variable-Increment Counting Bloom Filter
6
The Bloom Paradox
Sometimes, it is better to disregard the Bloom
filter results, and in fact not to even query it,
thus making the Bloom filter useless.
7
Example
Bloom filter
• Parameters:
• Extreme case without locality: All elements with equal probability of
belonging to the cache.
o Toy example
8
The Bloom Paradox
• Parameters:
• Let
be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Intuition:
B
user
Bloom filter
Bloom filter
cost = 1
S
cost = 10
cost = 10
local cache
x
central memory with
all elements
x
z
.
M
.
y
z
u
v
9
The Bloom Paradox
• Parameters:
• Let
be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Surprise:
B
Bloom filter
cost = 1
S
cost = 10
cost = 10
local cache
x
central memory with
all elements
x
z
.
M
.
y
z
u
v
9
The Bloom Paradox
• Parameters:
• Let
be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Surprise:
B
Bloom filter
.
.
The Bloom filter indicates the membership of
elements. Only
of them are indeed in
.
The Bloom Paradox
• When the Bloom filter states that
, it is wrong with probability
• Average cost if we listen to the Bloom filter:
•
Average cost if we don’t:
=
=

The Bloom filter is useless!
Don’t listen to
the Bloom filter
11
Outline
 Introduction to Bloom Filters
 The Bloom Paradox
 The Variable-Increment Counting Bloom Filter
12
Counting Bloom Filters (CBFs)
• Bloom filters do not support deletions of elements. Simply resetting
bits might cause false negatives.
y
x
1
0
1
0
1
1
1
1
0
0
0
1
0
0
1
0
0
0
• The solution: Counting Bloom filters - Storing array of
instead of bits.
o Insertion: Incrementing counters by one.
o Deletion: Decrementing counters by one.
o Query: Checking that counters are positive.
y
x
+1 +1
0
1
0
+1
+1
1
0
0
2
0
+1
1
counters
+1
0
1
• The same false positive probability.
• Require too much memory, e.g. 57 bits per element for
0
.
Intuition for Variable Increments
• Upon query, we should consider the exact values of the counters and
not just their positiveness
0
1
0
2
y
5
0
1
8
3
0
2
1
z
• Can we design a deterministic scheme that exploits the exact values
of the counters?
• Idea: Use variable increments to encode the element identity
14
Architecture
• Each hash entry contains a pair of counters:
o
, fixed increments → number of elements in entry (as in CBF)
o
, variable increments → weighted sum of elements
o weights from a pre-determined set
• We use two sets of hash functions:
o The first set
uses hash functions with range
, i.e. it points to the set of entries.
o The second set
uses hash functions with
range
, i.e. it points to the set
.
1
2
3
4
5
6
7
8
9
c1
0
5
3
2
2
3
3
3
4
c2
0
34
25
26
17
21
9
6
26
15
Insertion
• Insertion:
At each entry
, the two counters are updated as follows.
o
o
from the set
• Example 1:
1
2
3
4
5
6
7
8
9
c1
001
5
334
2
324
3
43 5
3
4
c2
008
34
25
2529
17
301743
21
30934
13
26
+8
x
+4
+13
z
+4
16
Query
• Query
c1
c2
y
( with
)
1
2
3
4
5
6
7
8
9
0
5
3
2
3
3
4
3
4
0
34
25
17
30
21
30
13
26
4?
8?
y?
• We ask whether
o 17 can be a sum of 2 elements from the set
o 30 can be a sum of 3 elements from the set
• No:
• How should we pick the set of variable increments?
We should use
including 4
including 8
Sequences!
17
Bh Sequences
• Definition 1:
Let
Then,
with
is a
be a sequence of positive integers.
sequence iff all the sums
are distinct.
• Example 2:
All the sums of
•
elements of
are distinct:
Therefore, is a
sequence.
sequences are widely used in error-correcting codes.
18
The Bh-CBF Scheme Query
• Example 3:
c1
c2
is a
sequence
1
2
3
4
5
6
7
8
9
0
5
3
2
3
3
4
3
4
0
34
25
17
30
21
30
13
26
1?
4?
X?
o Since
, then the Bh-CBF can determine that
19
The
The
Bh-CBF
Bh-CBF
Scheme
Scheme
Operations
Query
• Example 3:
c1
c2
is a
sequence
1
2
3
4
5
6
7
8
9
0
5
3
2
3
3
4
3
4
0
34
25
17
30
21
30
13
26
1?
X?
o Here,
Since
4?
4?
8?
y?
and then necessarily
, the Bh-CBF can determine that
19
The
The
Bh-CBF
Bh-CBF
Scheme
Scheme
Operations
Query
• Example 3:
c1
c2
is a
sequence
1
2
3
4
5
6
7
8
9
0
5
3
2
3
3
4
3
4
0
34
25
17
30
21
30
13
26
1?
X?
o Since
4?
4?
y?
8?
4?
13?
z?
, the Bh-CBF cannot exclude that
19
Experimental Results
• Internet trace (equinix-chicago) with real hash functions.
For the Bh-CBF,
(with
).
20
Concluding Remarks
• The Bloom Paradox
o Discovery of the Bloom paradox
o Importance of the a priori membership probability
• The Variable-Increment Counting Bloom Filter
o Can extend many variants of the counting Bloom filter
o First time
sequences are presented in networking applications
21
Thank You
Download