ADC2010_DistinctObje.. - CSE - University of New South Wales

advertisement
Counting Distinct Objects over
Sliding Windows
Presented by:
Muhammad Aamir Cheema
Joint work with
Wenjie Zhang, Ying Zhang and Xuemin Lin
University of New South Wales, Australia
Introduction
Counting distinct objects:
 Given a dataset D, return the number of distinct objects
in D.
Counting distinct objects against sliding windows:
 Given a data stream, return the number of distinct
objects that arrive at or after timestamp t.
Applications
 traffic management, call centers, wireless
communication, stock market etc.
Introduction
Approximate counting:
Let n be the actual number of distinct objects and n’ be
the reported answer. Build a sketch s.t. every query is
answered with the following guarantee;
|n-n’|/n ≤ ε with confidence (1 – δ)
Contribution:
 FM based algorithms



SE-FM (accuracy guarantee + space usage guarantee)
PCSA-based algorithm (No accuracy guarantee (although
practical) + more efficient)
k-Skyband
(Accuracy guarantee + efficient + no space usage
guarantee)
FM Algorithm
FM SKETCH
Let h(x) be a uniform hash function
 Let “pivot” p(y) be the position of left most 1bit of h(x)
 FM be an array of size k initialized to zero
 For each record x in dataset
 FM[pivot] = 1;
 Let B=FMmin be the position of left most 0-bit
of FM
 Number of distinct elements = α * 2B
where α = 1.2897385
Each bit i of h(x) has 1/2 probability to be one
r1
r2
r1
r3
r1
k=4
h(r1)
1
0
1
0
h(r2)
0
0
1
0
h(r3)
1
1
0
1
FM
0
1
0
0
1
0
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base
applications. JCSS 1985
FMmin = 1
FM Algorithm
Each bit i of h(x) has 1/2 probability to be one

A h(x) with first i bits zero and (i+1)th bit one has
a probability 1/2i+1
Let n be the number of distinct elements

FM[0] is accessed appx. n/2 times

FM[1] is accessed appx. n/4 times

….

FM[i] is accessed appx. n/2i+1 times





If i >> log2 n
 FM[i] will almost certainly be zero
If i << log2 n
 FM[i] will almost certainly be one
If i ≈ log2 n
 FM[i] may be zero or one
Hence, the first i for which FM[i] is zero may be
used to approximate number of distinct elements
n.
r1
r2
r1
r3
r1
h(r1)
1
0
1
0
h(r2)
0
0
1
0
h(r3)
1
1
0
1
1
0
1
0
FM
FMmin = 1
FM Algorithm


Use r hash functions to create r FM Sketches
Initialize each FM to zero
For each record x in dataset

FM1
1
0
1
0
0
0
For each hash function hi(x)

B1 = 1
FMi[pivot] = 1;
Let Bi be the position of left most 0-bit of FMi
 B = (B1 + B2 … + Br )/ r
 Number of distinct elements = α * 2B
where α = 1.2897385

Performance Guarantee: Let n be the actual number of
distinct objects, n’ be the reported answer and m be the
domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If n > 1/є
and k = O(log m + log 1/є + log 1/δ )
and r = O(1/є2 log 1/δ)
FM2
1
1
B2 = 2
FM3
1
1
0
1
B3 = 2
B = (1 + 2 + 2)/3 = 1.67
FM-based Algorithm
Maintaining one FM sketch
 For each record (x,t) in dataset
 FM[pivot] = t;
Answering a query
 For any t, let B = FMmin (t) be the position of
left most entry of FM with value less than t
 Number of distinct elements arrived after
(inclusive) t = α * 2B where α = 1.2897385
1
2
3
4
5
r1
r2
r3
r2
r2
h(r1)
1
0
1
0
h(r2)
0
0
1
0
h(r3)
1
1
0
1
0
1
3
0
0 0
4
5
2
FM
FMmin (4) = 0
FM-based Algorithm
Maintain r FM sketches
 Initialize each FM to zero
 For each record (x,t) in dataset

For each hash function hi(x)

FMi[pivot] = t;
Answering a query
 For any t, let Bi (t) be the position of left most entry smaller
than t in i-th FM
 Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r
 Number of distinct elements arrived after (inclusive) t = α * 2B
where α = 1.2897385
Performance Analysis
Let n be the actual number of distinct objects arriving not before time t, n’ be the reported
answer and m be the domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If n > 1/є
and k = O(log m + log 1/є + log 1/δ )
and r = O(1/є2 log 1/δ)



Total Space: O(1/є2 log 1/δ log m)
Total maintenance cost for one record: O(1/є2 log 1/δ log log m)
Total query cost: O(1/є2 log 1/δ log log m)
PCSA-based Algorithm
Maintain r FM sketches but update j < r sketches
 Generate j hash functions H(x) that map x to [1,r]
 Initialize each FM to zero
 For each record (x,t) in dataset

For each of the j hash functions H()


i = H(x)
Update i-th FM sketch
Answering a query
 For any t, let Bi (t) be the position of left most entry smaller than t in ith FM
 Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r
 Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j
where α = 1.2897385
Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting
algorithms for data base applications. JCSS 1985”
NOTE: No accuracy guarantee but performs well in practice
BJKST Algorithm
Main Idea
•
Let h() be a hash function to hash D to [1,m3] where m = |D|
•
For each record x, we generate its hash value h(x)
•
Maintain k-th smallest distinct hash value k_min
Number of distinct elements = n = km3/k_min
Improved algorithm
•
Use r hash functions
•
Compute ni for each hash function hi() as above
•
Report final answer as median of ni values
Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in
and r = O(log 1/δ)
datastream. In RANDOM'02.
K-Skyband Technique
Main Idea
•
Let h() be a hash function to hash D to [1,m3] where m = |D|
•
For each record (x,t’) we generate h(x) and store record (x, h(x), t’)
Answering a query q(t):
•
Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t
•
Get the k-th smallest distinct hashed value and apply BJKST algorithm
Limitation: Requires storing all records
K-Skyband Technique
For any time t, we need to find k-th smallest hash value
arriving no later than t
A record x dominates another record y if x arrives after y
and has smaller hash value
K-Skybands keeps only the objects that are dominated
by at most (k-1) records
Maintaining K-Skyband:
•
Keep a counter for each record
•
When a new element (x,t) arrives, increment the
counter of all records dominated by it
•
Remove the records with counter at least equal to k
k=2
b
e
c
t
d
a
We increment the counters of groups to improve
efficiency (Domination aggregation search tree)
h(x)
K-Skyband Technique
Answering Query:
Find k_min (the k-th smallest hash value among
elements arriving no later than t)
•
•
Let z be the number of elements arrived before t
k_min is the (z+k)-th overall smallest hash value
k_min = 5th smallest h(x)
k=2
Algorithm:
•
Maintain a binary search tree eT that stores elements
according to t
•
Maintain a binary search tree eH that stores elements t
according to h(x)
When a query q(t) arrives
a
•
Compute z by using eT
f
•
Find (z+k)-th overall smallest hash value from eH
b
e
c
d
z=3
h(x)
Performance Analysis
Let n be the actual number of distinct objects arriving not before time t, n’ be the reported
answer and m be the domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
Expected total space: O(1/є2 log 1/δ log n)
Expected time complexity: O(log 1/δ (log 1/є + log n))
Experiments
•
•
Synthetic datasets following Uniform and Zipf distribution
Real dataset WorldCup 98 HTTP requests (20 M records)
j
Space Efficiency
Space Efficiency
Time Efficiency
Maintenance cost
Time Efficiency
Query response time
Accuracy
Thanks

P. B. Gibbons. Distinct sampling for highly-accurate answers to
distinct values queries and event reports. In VLDB, 2001.
Space usage: 1/ε2 log 1/δ m1/2

Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatiotemporal aggregation using sketches. In ICDE 2004.
Space usage: O(N/ε2 log 1/δ log m)
Space Requirement (SE-FM)
To guarantee the performance we require the following;

k = O(log m + log 1/є + log 1/δ )

r = O(1/є2 log 1/δ)
Let m > 1/є and m > 1/δ; then k = O(log m)
Size of one sketch is k = O(log m);
Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m);
Total Space: O(1/є2 log 1/δ log m)
Time Complexity (SE-FM)
To guarantee the performance we require the following;

k = O(log m + log 1/є + log 1/δ )

r = O(1/є2 log 1/δ)
The elements in a sketch are stored in a min-heap to support logarithmic search/update;

Hence, cost of one search/update operation: O( log k) = O( log log m)

To maintain the sketches, we update r sketches for each record x


Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m)
To answer a query, we search in r sketches

Total cost: O( r log log m) = O(1/є2 log 1/δ log log m)
Space Usage (K-Skyband)
Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
1-δ
Expected size of k-skyband = O (k ln (n/k) )
Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n)
Time Complexity (K-Skyband)
Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
1-δ
Answering Query q(t):
Search eT to compute z: log (k log n) = O(log k + log n)
Search eH to find (z+t)-th element: O(log k + log n)
We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))
Download