The DCBA Scheme

advertisement
Detecting Duplicates over Sliding Windows
with RAM-Efficient Detached Counting
Bloom Filter Arrays
Jiansheng Wei†, Hong Jiang‡, Ke Zhou†, Dan Feng†, Hua Wang†
†School
of Computer, Huazhong University of Science and Technology, Wuhan, China
Wuhan National Laboratory for Optoelectronics, Wuhan, China
‡Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
Background and Motivation
 Duplicate detection is an important technique for monitoring
and analysing data streams.
 Recent studies estimate that about 294 billion emails are sent per day
all over the world in 2010 [1] and around 89.1% of them are spam
emails [2].
 If all the emails must be scanned for the purposes of anti-spam, antivirus or homeland security, it is important for the email server to
quickly identify duplicates and analyse only unique emails.
 Since the real-world data streams can be updated frequently
and grow to very large sizes, it is impractical to trace and
analyse all the elements in such data streams.
[1] The Radicati Group, Inc. (2010) Email Statistics Report, 2010-2014.
[2] Symantec Corp. (2010) MessageLabs Intelligence: 2010 Annual Security Report.
2011-7-29
2
Background and Motivation
 Existing approaches usually employ a decaying window model to evict
stale items and record the most recent elements.
 Landmark Window Model
landmark i
expired
landmark i+1
monitoring
landmark i+2
next window
incoming
 Jumping Window Model
jumping forward
k equal-sized sub-windows
…
expired
next window
incoming
 Sliding Window Model
sliding forward
The challenge is how to
design
a
membership
representation scheme that
supports
fast
search,
insertion and deletion of
time-ordered elements with
low RAM consumption.
N elements
data stream
2011-7-29
expired
monitoring
incoming
3
Background and Motivation


To achieve high space efficiency and maintain high query performance, Bloom
filters (BFs) have been widely used to represent membership of static data sets.
A BF for representing a static set S = {e1, e2, …, en} of n elements consists of an
array of m bits and a group of k independent hash functions h1, …, hk with the
range of {1, …, m}.
x
y
bit 1
BF 0 1 0 1 0 0 0 0 0 1 0 0 0
h1(x) h1(y)
h2(x)
h2(y)
…
bit m
1 0 1 0
hk(y) hk(x)
Fig. 1 Insert elements into a Bloom filter.
y
z
bit 1
BF 0 1 0 1 0 0 0 0 0 1 0 0 0
h1(y) h1(z)
h2(y)
h2(z)
…
bit m
1 0 1 0
hk(y) hk(z)
False positive:
y is probably a member;
No false negative:
z is definitely not a member.
Fig. 2 Query a Bloom filter.

2011-7-29
A BF faces challenges when dealing with a dynamic set, i.e., it does not support
element deletion.
4
Outline




2011-7-29
Background and Motivation
The DCBA Scheme
Analysis and Evaluation
Conclusions
5
The DCBA Scheme
 A detached counting Bloom filters array (DCBA)
consists of an array of detached counting Bloom
filters (DCBFs) that are homogeneous and share the
same hash functions.
 A DCBF is a Bloom filter (BF) with each of its bits
being associated with a counter, which functions as a
timer to record the timestamp of the represented
element.
 All the timers of a DCBF are further grouped into a
timer array (TA) to improve the access efficiency.
2011-7-29
6
The DCBA Scheme
 A window with size N slides over a data stream, and all the
monitored elements are represented by an array of g DCBFs
with each having a capacity of N/(g−1) elements.
expired
monitoring
incoming
RAM
N elements
data stream
TA1
BF1
BF2
…
BFg−1
decaying
BFg
filling
load
Flash
or
Disk
TAg
offload
TA1
TA2
…
TAg−1
TAg
Fig. 3 Using DCBA to represent a sliding window in a single node.
2011-7-29
7
The DCBA Scheme


The g DCBFs logically function as a circular FIFO queue.
Fully-filled DCBFs will be retained for query only until its elements become stale,
and the corresponding timer array (TA) that can consume a great amount of
memory space may be optionally offloaded to hard disks or flash store to save
RAM resources.
expired
monitoring
incoming
RAM
N elements
data stream
TA1
BF1
BF2
…
BFg−1
decaying
BFg
filling
load
Flash
or
Disk
TAg
offload
TA1
TA2
…
TAg−1
TAg
Fig. 3 Using DCBA to represent a sliding window in a single node.
2011-7-29
8
The DCBA Scheme

A DCBA can also be split and maintained by r nodes, where each node holds a
group of g DCBFs. All the r×g DCBFs logically function as a circular FIFO
queue, and there is a filling DCBF and a decaying DCBF to accommodate fresh
elements and evict stale elements respectively.
expired
monitoring
incoming
N elements
data stream
BF1 … BFg
decaying
TA1 … TAg
node 1
BFg×r−r+1 … BFg×r
…
filling
TAg×r−r+1 … TAg×r
node r
Fig. 4 Using DCBA to represent a sliding window in a
decentralized (clustered) system.
2011-7-29
9
The DCBA Scheme



Bits belonging to different BFs but with the same offset are mapped and stored in
the same bit vector so that they can be read or written simultaneously in a single
memory access.
Considering that k hash functions are used and shared among the DCBFs, the
memory access complexity of querying an item will be O(k) rather than O(k×g).
If a positive is produced by the decaying BF, the associated decaying TA will be
queried to check whether the item has already expired.
d
…
…
…
…
…
bit vectorm
39
1
0
1
…
BFg
0
1
0
TAg
0
11
0
…
2 −1
…
…
bit vector1
bit vector2
bit vector3
BF1 … BF3
0
1
0
0
0
1
1
1
0
TA1
0
0
1
27
hi (∙)
(0≤i≤k)
Fig. 5 The in-memory structure of a DCBA.
2011-7-29
10
The DCBA Scheme
 The bit width d of each timer is fundamentally determined by the capacity
of the DCBA. Suppose that the capacity of a DCBA is N, each DCBF will
be designed to hold N/(g−1) elements, and each timer will contain
d=⌈log2N/(g−1)⌉ bits to count from 0 to N/(g−1)−1, where 0 denotes the
oldest timestamp.
 E.g., if a DCBF is designed to accommodate 1M (220) elements, then each
constructing timer will consume 20 bits to count from 0 to 220−1.
d
…
…
…
…
…
bit vectorm
39
1
0
1
…
BFg
0
1
0
TAg
0
11
0
…
2 −1
…
…
bit vector1
bit vector2
bit vector3
BF1 … BF3
0
1
0
0
0
1
1
1
0
TA1
0
0
1
27
hi (∙)
(0≤i≤k)
Fig. 5 The in-memory structure of a DCBA.
2011-7-29
11
The DCBA Scheme
 The DCBA scheme maintains a base timer that counts from 0 to
N/(g−1)−1 in a circular manner to generate timestamps for the monitored
elements.
 To insert an element x, k bits in the filling BF will be chosen and set to 1
according to the hash functions hi(x) (1≤i≤k), and the k associated timers
in the filling TA will be set to the value of the base timer.
 The base timer will be incremented by 1 after an insert operation.
 An element in the decaying DCBF will be considered expired once its
timestamp becomes smaller than the base timer.
d
…
…
…
…
…
bit vectorm
39
1
0
1
…
BFg
0
1
0
TAg
0
11
0
…
2 −1
…
…
bit vector1
bit vector2
bit vector3
BF1 … BF3
0
1
0
0
0
1
1
1
0
TA1
0
0
1
27
base timer
hi (∙)
(0≤i≤k)
Fig. 5 The in-memory structure of a DCBA.
2011-7-29
12
The DCBA Scheme
 Since a representing bit as well as its associated timer can be shared by
multiple elements in a DCBF, we determine the timestamp of an element
according to a count-min policy.
 The minimal value t among all the k timers corresponding to an element x will
be considered as its timestamp.
 The probability that all the k timers corresponding to x are occasionally
shared and set by other elements with larger timestamps than x is very
small and can be constrained by restricting the target error rate of each
DCBF.
… BF
TA
BF … BF
TA
1
0
1
0
…
…
…
bit vectorm
39
1
0
1
d
…
g
0
1
0
0
11
0
…
1
0
1
g
…
2 −1
0
0
1
…
0
0
3
…
base timer
bit vector1
bit vector2
bit vector3
1
1
27
hi (∙)
(0≤i≤k)
Fig. 5 The in-memory structure of a DCBA.
2011-7-29
13
Outline




2011-7-29
Background and Motivation
The DCBA Scheme
Analysis and Evaluation
Conclusions
14
Analysis and Evaluation
 RAM Consumption
 If the overall false positive rate of the DCBA is constrained to εDCBA,
the error rate threshold of each DCBF should be
εDCBF=1−(1−εDCBA)1/g.
 The optimal number of hush functions can be derived as
kDCBF=⌈log2(1/εDCBF)⌉.
 The total space requirement of a DCBA is expressed as
mDCBA-total=g×(d+1)×⌈log2e×kDCBF×N/(g−1)⌉.
 Since the DCBA scheme allows up to g−2 TAs to be offloaded to
disks, the minimal RAM consumption of a DCBA is therefore
mDCBA-RAM=(g +2d)×⌈log2e×kDCBF×N/(g−1)⌉.
2011-7-29
15
Analysis and Evaluation
 RAM Consumption
DCBA-RAM
TBF
DCBA-total
BF
DCBA-RAM
TBF
3,500
6,000
100,000
130,000
3,150
5,400
90,000
117,000
2,800
4,800
80,000
104,000
2,450
4,200
70,000
91,000
2,100
3,600
60,000
78,000
3,000
1,750
20
7
N=64M (1M=2 ), ε =1/2
1,400
20
N=64M (1M=2 ), ε =1/2
2,400
1,050
1,800
700
1,200
350
600
14
0
0
8
16
32
64
128
8
16
32
64
The number of DCBFs (logarithmic scale)
(a) Representing a sliding window
with size 64M (1M=220).
128
RAM consumption (MB)
RAM consumption (MB)
BF
65,000
50,000
20
40,000
DCBA-total
N=1,024M(1M=2 ), ε =1/2
14
20
52,000
30,000
39,000
20,000
26,000
10,000
13,000
0
64
128
256
512
1024
N=1,024M(1M=2 ), ε =1/2
0
64
128
256
512
20
1024
The number of DCBFs (logarithmic scale)
(b) Representing a sliding window
with size 1,024M (1M=220).
Fig. 6 Memory consumption of a DCBA.
2011-7-29
16
Analysis and Evaluation
 We have collected a sequence of 2,026,005,927 chunk
fingerprints that contains 83,733,597 unique elements, which
provides a real-world data set for measuring the performance
and query accuracy of the DCBA scheme.
 The experimental server that maintains a DCBA for
monitoring the chunk fingerprint stream over a sliding
window is configured as follows:
 a 32-bit Windows operating system
 a quad-core CPU running at 2 GHz
 4×2GB RAM
 16×1TB hard disks organized as a RAID-5 partition
 2 gigabit network interface cards.
2011-7-29
17
Analysis and Evaluation
 Query Performance
700,000
600,000
500,000
20
N=32M (1M=2 ), ε =1/2
400,000
7
TBF, k=7
DCBA-16, k=11
DCBA-32, k=12
DCBA-64, k=13
300,000
200,000
100,000
0
8
16
24
32
40
48
56
64
72
80
20
The Number of Processed Unique Fingerprints (M, 1M=2 )
(a) Representing a sliding window
with an error rate threshold of 1/27.
Query Performance (fingerprints / sencond)
Query Performance (fingerprints / sencond)
600,000
800,000
500,000
400,000
20
300,000
200,000
100,000
N=32M (1M=2 ), ε =1/2
14
TBF, k=14
DCBA-16, k=18
DCBA-32, k=19
DCBA-64, k=20
0
8
16
24
32
40
48
56
64
72
80
20
The Number of Processed Unique Fingerprints (M, 1M=2 )
(b) Representing a sliding window
with an error rate threshold of 1/214.
Fig. 7 Average query performance of a DCBA.
2011-7-29
18
Analysis and Evaluation
 Query Accuracy
0.000065
0.008
0.00006
0.007
0.00005
20
N=32M (1M=2 ), ε =1/2
0.005
7
0.004
Error Rate Threshold
TBF, k=7
DCBA-16, k=11
DCBA-32, k=12
DCBA-64, k=13
0.003
0.002
0.001
False Positive Rate
False Positive Rate
0.006
20
N=32M (1M=2 ), ε =1/2
0.00004
0.00003
14
Error Rate Threshold
TBF, k=14
DCBA-16, k=18
DCBA-32, k=19
DCBA-64, k=20
0.00002
0.00001
0.00000
0.000
8
16
24
32
40
48
56
64
72
80
20
The Number of Processed Unique Fingerprints (M, 1M=2 )
(a) Representing a sliding window
with an error rate threshold of 1/27.
8
16
24
32
40
48
56
64
72
80
20
The Number of Processed Unique Fingerprints (M, 1M=2 )
(b) Representing a sliding window
with an error rate threshold of 1/214.
Fig. 8 False positive rate of a DCBA.
2011-7-29
19
Outline




2011-7-29
Background and Motivation
The DCBA Scheme
Analysis and Evaluation
Conclusions
20
Conclusion
 This paper proposes a Detached Counting Bloom filter Array (DCBA)
scheme to address the problem of detecting duplicates in data streams over
sliding windows.




High query performance
High space efficiency
Scalability
Easy to be synchronized
 Mathematical analysis and experimental results show that
 a DCBA can achieve a high query performance that is comparable to the state-
of-the-art timing Bloom filter approach in the same environment with a much
lower RAM overhead than the latter.
 the actual error rate of a DCBA can be well constrained at or below its
predefined threshold.
 In general, a DCBA outperforms existing schemes and is more flexible in
representing massive stream elements in sliding windows.
2011-7-29
21
Thanks!
Questions?
Acknowledgment
This work is supported in part by the National Basic
Research Program (973 Program) of China under Grant No.
2011CB302305, the National High Technology Research and
Development Program (863 Program) of China under Grant
No. 2009AA01A402, and the US NSF under Grants NSF-IIS0916859,
NSF-CCF-0937993
and
NSF-CNS-1016609.
The
authors are grateful to the anonymous reviewers for
their valuable comments and suggestions.
2011-7-29
22
Download