Efficient Computation of Frequent and k Ahmed Metwally Divyakant Agrawal

advertisement
Efficient Computation of Frequent and
Top-k Elements in Data Streams
Ahmed Metwally
Divyakant Agrawal
Amr El Abbadi
Department of Computer Science
University of California, Santa Barbara
1
Motivation

Motivated by Internet advertising commissioners
 Before rendering an advertisement for user, query clicks
stream for advertisements to display.
 If the user's profile is not a frequent “clicker”, then s/he
will probably not click any displayed advertisement.
– Show Pay-Per-Impression advertisements.

If the user's profile is a frequent “clicker”, then s/he may
click a displayed advertisement.
– Show Pay-Per-Click advertisements.
– Retrieve top advertisements to choose what to display.
3
Problem Definition



Given alphabet A, stream S of size N, a frequent
element, E, is an element whose frequency, F,
exceeds a user specified support, φN
Top-k elements are the k elements with highest
frequency
Both problems:
– Very related, though, no integrated solution has been
proposed
– Exact solution is O(min(N,A)) space
 approximate variations
4
Practical Frequent Elements

-Deficient Frequent Elements [Manku ‘02]:
– All frequent elements output should have
F > (φ - )N, where  is the user-defined error.
φN
(φ - ) N
5
Practical Top-k

FindApproxTop(S, k, ) [Charikar ‘02]:
– Retrieve a list of k elements such that every
element, Ei, in the list has Fi > (1 - ) Fk, where Ek
is the kth ranked element.
F4
(1 - ) F4
6
Related Work

Algorithms Classification
– Counter-Based techniques
• Keep an individual counter for each element
• If the observed ID is monitored, its counter is updated
• If the observed ID is not monitored, algorithm dependent
action
– Sketch-Based techniques
• Estimate frequency for all elements using bit-maps of
counters
• Each element is hashed into the counters’ space using a
family of hash functions.
• Hashed-to counters are queried for the frequencies
7
Recent Work (Comparison)
Algorithm
Nature
Space Bound
Handles
CountSketch
[Charikar ‘02]
Sketch
O(k/2 log N/δ), δ is
the failure probability
FindApproxTop
(S, k, )
GroupTest [Cormode
’03]
Sketch
O(φ-1 log(φ-1) log(|A|))
Hot Items
Frequent [Demaine
’02]
Counter
O(1/), proved by [Bose FE
Probabilistic-Inplace
[Demaine ’02]
Counter
O(m), m is the
available memory
FindCandidate
Top(S, k, m/2)
Lossy Counting
[Manku ’02]
Counter
(1/) log(N)
-Deficient FE
Sticky Sampling
[Manku ’02]
Counter
(2/) log(φ-1δ-1)
-Deficient FE
‘03]
8
Outline






Problem Definition
Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries
Answering Top-k Queries
Experimental Results
Conclusion
9
The Space-Saving Algorithm
Space-Saving is counter-based
 Monitor only m elements
 Only over-estimation errors
 Frequency estimation is more accurate
for significant elements
 Keep track of max. possible errors

10
Space-Saving By Example
A
B
Element
Count
2
3
4
5
error (max possible) 0
B
A
E
2
3
4
0
3
C
D
A
1
2
3
4
1
0
3
ABBACABBDDBE C
Space-Saving Algorithm
– For every element in the stream S
– If a monitored element is observed
• Increment
Incrementits
itsCount
Count
– If a non-monitored element is observed,
• Replace
Replace the
theelement
elementwith
withminimum
minimum
hits,
hits,
minmin
• Increment
Increment the
theminimum
minimum
Count
Count
to to
minmin
+ 1+ 1
• maximum
maximum possible
possibleover-estimation
over-estimation
is error
is error
11
Space-Saving Observations
S = ABBACABBDDBEC N = 13

Observations:
– The summation of the Counts is N
– Minimum number of hits, min ≤ N/m
– In this example, min = 4
– The minimum number of hits, min, is an upper
bound on the error of any element
B
Element
Count
5
error (max possible) 0
E
4
3
C
4
3
12
Space-Saving Proved Properties
S = ABBACABBDDBEC N = 13
1.
If Element E has frequency F > min, then E must
be in Stream-Summary. F(B) = F1 = 5, min = 4.
2.
The Count at position i in Stream-Summary is no
less than Fi, the frequency of the ith ranked
element. F(A) = F2 = 3, Count2 = 4.
B
Element
Count
5
error (max possible) 0
E
4
3
C
4
3
13
Space-Saving Data Structure

We need a data structure that
– Increments counters in constant time
– Keeps elements sorted by their counters

We propose the Stream-Summary
structure, similar to the data structure in
[Demaine ’02]
16
Frequent Elements Queries
Traverse Stream-Summary, and report
all elements that satisfy the user
support
 Any element whose
guaranteed hits = (Count – error) > φN
is guaranteed to be a frequent element

18
Frequent Elements Example
Element
B
Count
error
D
G
A
Q
F
C
E
20 14
12
9
7
5
3
3
1
4
1
3
0
1
2
8
8
4
5
2
1
0
Guaranteed Hits = Count - error 19 14

For N = 73, m = 8, φ = 0.15:
– Frequent Elements should have support of 11 hits.
– Candidate Frequent Elements are B, D, and G.
– Guaranteed Frequent Elements are B, and D,
since their guaranteed hits > 11.
19
Frequent Elements Space Bounds
Space Bounds
General Distribution
Zipf(α)
Space-Saving
O(1/)
(1/)(1/α)
GroupTest
O(φ-1 log(φ-1) log(|A|))
Frequent
O(1/) proved by[Bose’03]
Lossy Counting (1/) log(N)
Sticky Sampling (2/) log(φ-1δ-1)
20
Top-k Elements Queries

Traverse the Stream-Summary, and report
top-k elements.
 From Property 2, we assert:
– Guaranteed top-k elements:
• Any element whose guaranteed hits = (Count – error) ≥
Countk+1, is guaranteed to be in the top-k.
– Guaranteed top-k’ (where k’≈k):
• The top-k’ elements reported are guaranteed to be the
correct top-k’ iff for every element in the top-k’,
guaranteed hits = (Count – error) ≥ Countk’+1.
26
Top-k Elements Example
Element
B
Count
error
D
G
A
Q
F
C
E
20 14
12
9
7
5
3
3
1
4
1
3
0
1
2
8
8
4
5
2
1
0
Guaranteed Hits = Count - error 19 14

For k = 3, m = 8:
– B, D, and G are the top-3 candidates.
– B, and D are guaranteed to be in the top-3.
– B , D, G and A are guaranteed to be the top-4.
Here k’ = 4.
– B , and D are guaranteed to be the top-2. Another
k’ = 2.
27
Top-k Elements Space Bounds
Space
Bounds
SpaceSaving
General
Distribution
FindApproxTop(S,
k, ):
O(k/ * log(N))
CountSketch FindApproxTop(S,
k, ):
O(k/2 * log(N / δ))
Zipf(α)
Exact Top-k Problem:
α = 1: O(k2 log(A) )
α > 1: O((k/ α)(1/α) k )
FindApproxTop(S, k, ):
α ≥ 1: O(k * log(N / δ))
28
Outline






Problem Definition
Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries
Answering Top-k Queries
Experimental Results
Conclusion
32
Experimental Results - Setup



Synthetic data:
– Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0
– N = 107 hits.
Real Data (ValueClick, Inc.): Similar results
Precision:
– number of correct elements found / entire output

Recall:
– number of correct elements found / number of actual correct

Run time:
– Processing Stream + Query Time

Space used:
– Including hash table
33
Frequent Elements Results
Query: φ = 10-2,  = 10-4, and δ = 10-2
 We compared with

– GroupTest and Frequent

All algorithms had a recall of 1.
– That is, they all output the correct elements
among their output.

Space-Saving was able to guarantee all
its output to be correct
34
Frequent Elements Precision
Precision for Frequent Elements (>100,000 Hits)
on Synthetic Data
Space-Saving
1
1 1
1
1
0.9
1
GroupTest
1
1
1
Frequent
1
1
1
1
1
0.833333
Precision
0.8
0.7
0.6
0.5
0.4
0.3
0.2157
0.2
0.1053
0.1
0
0.0707
0.0526
0.0889
0
0
0
0.5
1
1.5
2
2.5
3
Zipf Alpha
35
Frequent Elements Run Time
Run Time for Frequent Elements (>100,000 Hits)
on Synthetic Data
Space-Saving
GroupTest
Frequent
60000
Run Time (ms)
50000
50031
47937
49578
45172
43844
43734
43141
40000
30000
24281
28015
26500
26125
27250
27218
25906
20000
11906
12281
10375
7453
10000
7516
7593
6704
0
0
0.5
1
1.5
2
2.5
3
Zipf Alpha
36
Frequent Elements Space Used
Space Used for Frequent Elements (>100,000
Hits) on Synthetic Data
Space-Saving
GroupTest
Frequent
Space Used (Bytes)
180000
168260
160000
168260
168260
168260
168260 168260
168260
140000
120000
100000
78460
80000
67756
58460
60000
38240
40000
13760
13760
20000
13760
13760
13760
16588
13760
5636
13760
2796
0
0
0.5
1
1.5
2
2.5
3
Zipf Alpha
37
Top-k Elements Results
Query: k = 100,  = 10-4, and δ = 10-2
 We compared with

– CountSketch: CountSketch was re-run several
times. The hidden constant was estimated to be
16, in order to have output of competitive quality.
– Probabilistic-InPlace: was allowed the same
number of counters as Space-Saving

Space-Saving was able to guarantee all its
output to be correct
38
Top-k Elements Precision
Precision for Top-100 on Synthetic Data
Space-Saving
1
1
1
0.92
1
CountSketch
0.98
1
Probabilistic InPlace
1
0.99 1
0.99 1
1
1
1
1
0.9
0.8
Precision
0.7
0.6
0.5
0.358423
0.4
0.3
0.2
0.1
0.133333
0.1
0.0182
0.02
0.02
0
0
0.5
1
1.5
2
2.5
3
Zipf Alpha
39
Top-k Elements Recall
Recall for Top-100 on Synthetic Data
Space-Saving
1
1
1
0.91
1
0.92
CountSketch
Probabilistic InPlace
1 0.98 1
1 0.99 1
1 0.99 1
1 1 1
1 1 1
1
1.5
2
2.5
3
0.9
0.8
Recall
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0
0
0.5
Zipf Alpha
40
Top-k Elements Run Time
Run Time for Top-100 on Synthetic Data
Space-Saving
2000000
1931797
CountSketch
Probabilistic InPlace
1860453
1800000
Run Time (ms)
1600000
1339343
1400000
1200000
1000000
848141
768547
800000
757922
754813
600000
400000
200000
30375
27609
26391
23531
27984
28985
30078
32078
26125
25422
25703
29797
32250
25390
0
0
0.5
1
1.5
2
2.5
3
Zipf Alpha
41
Top-k Elements Space Used
Space Used for Top-100 on Synthetic Data
Space-Saving
Space Used (Bytes)
450000
406330
407070
CountSketch
407070
407070
Probabilistic InPlace
407010
406570
403930
400000
350000
300000
250000
200000
150000
78460
100000 58460
50000
20338
39418
38240
67756
62674
10874
0
0
0.5
1
1.5
6534
3436
15470
6916
16588
2
2.5
3254
3
Zipf Alpha
42
Conclusion

Contributions:
– An integrated approach to solve an interesting
family of problems
– Strict error bounds using little space
– Guarantees on results
– Special attention was given to Zipfian data
– Experimental validation

Future Work:
– Incremental frequent and top-k elements reporting
44
Download