Streaming Algorithms CS6234 Advanced Algorithms February 10 2015 1

advertisement
Streaming Algorithms
CS6234 Advanced Algorithms
February 10 2015
1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs
• We cannot store the entire stream
• Processing in real-time
• Limited memory (usually sub linear in the size of the stream)
• Goal: Compute a function of stream, e.g., median, number of
distinct elements, longest increasing sequence
Approximate answer is usually preferable
2
Overview
Counting bits with DGIM algorithm
Bloom Filter
Count-Min Sketch
Approximate Heavy Hitters
AMS Sketch
AMS Sketch Applications
3
Counting bits with DGIM
algorithm
Presented by
Dmitrii Kharkovskii
4
Sliding windows
• A useful model : queries are about a window of length N
•
The N most recent elements received (or last N time units)
• Interesting case: N is still so large that it cannot be stored
•
Or, there are so many streams that windows for all cannot be stored
5
Problem description
• Problem
•
Given a stream of 0’s and 1’s
•
Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N
•
Obvious solution
•
Store the most recent N bits (i.e., window size = N)
•
When a new bit arrives, discard the N +1st bit
•
Real Problem
•
Slow โ€ need to scan kโ€bits to count
•
What if we cannot afford to store N bits?
• Estimate with an approximate answer
6
Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
Overview
• Approximate answer
• Uses ๐‘‚(๐‘™๐‘œ๐‘”2 N) of memory
• Performance guarantee: error no more than 50%
• Possible to decrease error to any fraction ๐œ€ > 0 with ๐‘‚(๐‘™๐‘œ๐‘”2 N) memory
• Possible to generalize for the case of positive integer stream
7
Main idea of the algorithm
Represent the window as a set of exponentially growing non-overlapping buckets
8
Timestamps
• Each bit in the stream has a timestamp - the position in the stream from the
beginning.
• Record timestamps modulo N (window size) - use o(log N) bits
• Store the most recent timestamp to identify the position of any other bit in the
window
9
Buckets
• Each bucket has two components:
• Timestamp of the most recent end. Needs ๐‘‚(๐‘™๐‘œ๐‘” N) bits
• Size of the bucket - the number of ones in it.
•
Size is always 2๐‘— .
•
To store j we need ๐‘‚(๐‘™๐‘œ๐‘” ๐‘™๐‘œ๐‘” N) bits
• Each bucket needs ๐‘‚(๐‘™๐‘œ๐‘” N) bits
10
Representing the stream by buckets
•
The right end of a bucket is always a position with a 1.
•
Every position with a 1 is in some bucket.
•
Buckets do not overlap.
•
There are one or two buckets of any given size, up to some maximum size.
•
All sizes must be a power of 2.
•
Buckets cannot decrease in size as we move to the left (back in time).
11
Updating buckets when a new bit arrives
• Drop the last bucket if it has no overlap with the window
• If the current bit is zero, no changes are needed
• If the current bit is one
•
Create a new bucket with it. Size = 1, timestamp = current time modulo N.
•
If there are 3 buckets of size 1, merge two oldest into one of size 2.
•
If there are 3 buckets of size 2, merge two oldest into one of size 4.
•
...
12
Example of updating process
13
Query Answering
How many ones are in the most recent k bits?
• Find all buckets overlapping with last k bits
• Sum the sizes of all but the oldest one
Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24
• Add the half of the size of the oldest one
k
14
Memory requirements
15
Performance guarantee
• Suppose the last bucket has size 2๐‘Ÿ .
• By taking half of it, maximum error is 2๐‘Ÿ−1
• At least one bucket of every size less than 2๐‘Ÿ
• The true sum is at least 1+ 2 + 4 + … + 2๐‘Ÿ−1 = 2๐‘Ÿ - 1
• The first bit of the last bucket is always equal to 1.
• Error is at most 50%
16
References
J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”.
Cambridge University Press
18
Bloom Filter
Presented byNaheed Anjum Arafat
19
Motivation:
The “Set Membership” Problem
• x: An Element
• S: A Set of elements (Finite)
• Input: x, S
• Output:
Streaming Algorithm:
• Limited Space/item
• Limited Processing time/item
• Approximate answer based on a summary/sketch
of the data stream in the memory.
• True (if x in S)
• False (if x not in S)
Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)
20
Bloom Filter
• Consists of
• vector of n Boolean values, initially all set false (Complexity:- O(n) )
• k independent and uniform hash functions, โ„Ž0 , โ„Ž1 , … , โ„Žk−1
each outputs a value within the range {0, 1, … , n-1}
F
F
F
F
F
F
F
F
F
F
0
1
2
3
4
5
6
7
8
9
n = 10
21
Bloom Filter
• For each element sฯตS, the Boolean value at positions
โ„Ž0 ๐‘  , โ„Ž1 ๐‘  , … , โ„Ž๐‘˜−1 ๐‘  are set true.
• Complexity of Insertion:- O(k)
๐‘ 1
โ„Ž0 ๐‘ 1 = 1
โ„Ž2 ๐‘ 1 = 6
โ„Ž1 ๐‘ 1 = 4
F
TF
F
F
FT
F
TF
F
F
F
0
1
2
3
4
5
6
7
8
9
k=3
22
Bloom Filter
• For each element sฯตS, the Boolean value at positions
โ„Ž0 ๐‘  , โ„Ž1 ๐‘  , … , โ„Ž๐‘˜−1 ๐‘  are set true.
Note: A particular Boolean value may
be set to True several times.
๐‘ 1
โ„Ž0 ๐‘ 2 = 4
๐‘ 2
โ„Ž1 ๐‘ 2 = 7
โ„Ž2 ๐‘ 2 = 9
F
T
F
F
T
F
T
TF
F
FT
0
1
2
3
4
5
6
7
8
9
k=3
23
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Output: Boolean
For all i ฯต {0,1,…,k-1}
if hi(x) is False
return False
return True
Runtime Complexity:- O(k)
๐‘ 1
๐‘ 2
F
T
F
F
T
F
T
T
F
T
0
1
2
3
4
5
6
7
8
9
๐‘ฅ = S1
๐‘ฅ = S3
k=3
24
Algorithm to Approximate Set Membership Query
False Positive!!
๐‘ 1
โ„Ž0 ๐‘ 1 = 1
๐‘ 2
โ„Ž2 ๐‘ 1 = 6
โ„Ž0 ๐‘ 2 = 4
โ„Ž1 ๐‘ 1 = 4
โ„Ž1 ๐‘ 2 = 7
โ„Ž2 ๐‘ 2 = 9
F
T
F
F
T
F
T
T
F
T
0
1
2
3
4
5
6
7
8
9
โ„Ž1 ๐‘ฅ = 6
โ„Ž2 ๐‘ฅ = 1
๐‘ฅ
โ„Ž0 ๐‘ฅ = 9
k=3
25
Error Types
• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter
• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?
26
Probability of false positives
S2
S1
F
T
F
T
F
F
T
F
T
F
n = size of table
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that โ„Ž๐‘– ๐‘ฅ does not set bit j after hashing only 1 item:
1
๐‘ƒ โ„Ž๐‘– ๐‘ฅ ≠ ๐‘— = 1 − ๐‘›
Probability that โ„Ž๐‘– ๐‘ฅ does not set bit j after hashing m items:
๐‘ƒ ∀๐‘ฅ ๐‘–๐‘› {๐‘†1 , ๐‘†2 , … , ๐‘†๐‘š }: โ„Ž๐‘– ๐‘ฅ ≠ ๐‘— = 1 −
1 ๐‘š
๐‘›
27
Probability of false positives
S1
F
T
F
S2
T
F
F
T
F
T
F
n = size of table
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
๐‘ƒ ∀๐‘ฅ ๐‘–๐‘› ๐‘†1 , ๐‘†2 , … , ๐‘†๐‘š
We know that, 1 −
⇒ 1−
1 ๐‘˜๐‘š
=
๐‘›
1 ๐‘›
๐‘›
1−
1
, ∀๐‘– ๐‘–๐‘› 1,2, … , ๐‘˜ : โ„Ž๐‘– (๐‘ฅ) ≠ ๐‘— = 1 −
๐‘›
๐‘˜๐‘š
1
≈ e = ๐‘’ −1
1 ๐‘›
๐‘›
๐‘˜๐‘š ๐‘›
≈ ๐‘’ −1
๐‘˜๐‘š ๐‘›
= ๐‘’ −๐‘˜๐‘š
๐‘›
28
Probability of false positives
S1
F
T
F
S2
T
F
F
T
F
T
n = size of table
m = number of items
k = number of hash functions
F
Approximate
Probability of
False Positive
Probability that bit j is not set ๐‘ƒ ๐ต๐‘–๐‘ก ๐‘— = ๐น = ๐‘’ −๐‘˜๐‘š ๐‘›
The prob. of having all k bits of a new element already set =
๐Ÿ − ๐’†− ๐’Œ๐’Ž
๐’ ๐’Œ
For a fixed m, n which value of k will minimize this bound? kopt = log๐‘’ 2 ⋅ ๐‘› ๐‘š
The probability of False Positive =
1
( )๐‘˜๐‘œ๐‘๐‘ก =
2
(0.6185)
๐‘›
๐‘š
Bit per item
29
Bloom Filters: cons
• Small false positive probability
• Cannot handle deletions
• Size of the Bit vector has to be set a priori in order to maintain a
predetermined FP-rates :- Resolved in “Scalable Bloom Filter” –
Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable
Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261
30
References
• https://en.wikipedia.org/wiki/Bloom_filter
• Graham Cormode, Sketch Techniques for Approximate Query
Processing, ATT Research
• Michael Mitzenmacher, Compressed Bloom Filters, Harvard
University, Cambridge
31
Count-Min Sketch
Erick Purwanto
A0050717L
Motivation Count-Min Sketch
• Implemented in real system
– AT&T: network switch to analyze network traffic
using limited memory
– Google: implemented on top of MapReduce
parallel processing infrastructure
• Simple and used to solve other problems
– Heavy Hitters by Joseph
– Second Moment ๐น2 , AMS Sketch by Manupa
– Inner Product, Self Join by Sapumal
Frequency Query
• Given a stream of data vector ๐‘ฅ of length ๐‘›, ๐‘ฅ๐‘– ∈
[1, ๐‘š] and update (increment) operation,
– we want to know at each time, what is ๐‘“๐‘— the
frequency of item ๐‘—
๐‘—
๐‘ฅ…
– assume frequency ๐‘“๐‘— ≥ 0
• Trivial if we have count array [1, ๐‘š]
– we want sublinear space
– probabilistically approximately correct
Count-Min Sketch
• Assumption:
– family of ๐‘‘–independent hash function ๐ป
– sample ๐‘‘ hash functions โ„Ž๐‘– ← ๐ป
๐‘ฅ…
๐‘—
โ„Ž๐‘– โˆถ 1, ๐‘š → [1, ๐‘ค]
1
โ„Ž๐‘– (๐‘—)
๐‘ค
• Use: ๐‘‘ indep. hash func. and integer array CM[๐‘ค, ๐‘‘]
Count-Min Sketch
• Algorithm to Update:
– Inc(๐‘—) : for each row ๐‘–, CM[๐‘–, โ„Ž๐‘– (๐‘—)] += 1
๐‘ฅ…
๐‘—
โ„Ž1
โ„Ž2
CM
1
+1
+1
โ„Ž๐‘‘
+1
1
๐‘‘
๐‘ค
Count-Min Sketch
• Algorithm to estimate Frequency Query:
– Count(๐‘—) : ๐‘“๐‘— = min๐‘– CM[๐‘–, โ„Ž๐‘– (๐‘—)]
๐‘—
โ„Ž1
โ„Ž2
CM
1
โ„Ž๐‘‘
๐‘‘
1
๐‘ค
Collision
• Entry CM ๐‘–, โ„Ž๐‘– ๐‘— is an estimate of the frequency
of item ๐‘— at row ๐‘–
– for example, โ„Ž1 5 = โ„Ž1 2 = 7
๐‘ฅ… 3 5 5 8 5 2 5
row 1
1
7
๐‘ค
• Let ๐‘“๐‘— : frequency of ๐‘—, and random variable
๐‘‹๐‘–,๐‘— : frequency of all ๐‘˜ ≠ ๐‘—, โ„Ž๐‘– ๐‘˜ = โ„Ž๐‘– (๐‘—)
Count-Min Sketch Analysis
row ๐‘–
1
โ„Ž๐‘– (๐‘—)
๐‘ค
• Estimate frequency of ๐‘— at row ๐‘–:
๐‘“๐‘–,๐‘— = CM ๐‘–, โ„Ž๐‘– ๐‘—
๐‘›
= ๐‘“๐‘— +
๐‘“๐‘˜
๐‘˜≠๐‘—, โ„Ž๐‘– ๐‘˜ =โ„Ž๐‘– ๐‘—
= ๐‘“๐‘— + ๐‘‹๐‘–,๐‘—
Count-Min Sketch Analysis
• Let ๐œ€ : approximation error, and set ๐‘ค = ๐‘’
๐œ€
• The expectation of other item contribution:
E[๐‘‹๐‘–,๐‘— ] =
๐‘˜≠๐‘— ๐‘“๐‘˜
⋅ Pr[ โ„Ž๐‘– ๐‘˜ = โ„Ž๐‘– ๐‘— ]
≤ Pr โ„Ž๐‘– ๐‘˜ = โ„Ž๐‘– ๐‘—
1
=
⋅ ๐น1
๐‘ค
๐œ€
= ⋅ ๐น1
๐‘’
⋅
๐‘˜ ๐‘“๐‘˜ .
Count-Min Sketch Analysis
• Markov Inequality: Pr[ ๐‘‹ ≥ ๐‘˜ โˆ™ E ๐‘‹ ] ≤ 1
๐‘˜
• Probability an estimate ๐œ€ ⋅ ๐น1 far from true value:
Pr ๐‘“๐‘–,๐‘— > ๐‘“๐‘— + ๐œ€ โˆ™ ๐น1
= Pr[ ๐‘‹๐‘–,๐‘— > ๐œ€ โˆ™ ๐น1 ]
= Pr[ ๐‘‹๐‘–,๐‘— > ๐‘’ ⋅ E ๐‘‹๐‘–,๐‘— ]
1
≤
๐‘’
Count-Min Sketch Analysis
• Let ๐›ฟ : failure probability, and set ๐‘‘ = ln(1 ๐›ฟ)
• Probability final estimate far from true value:
Pr ๐‘“๐‘— > ๐‘“๐‘— + ๐œ€ โˆ™ ๐น1 = Pr ∀๐‘– โˆถ ๐‘“๐‘–,๐‘— > ๐‘“๐‘— + ๐œ€ โˆ™ ๐น1
= ( Pr ๐‘“๐‘–,๐‘— > ๐‘“๐‘— + ๐œ€ โˆ™ ๐น1 )๐‘‘
≤
1
๐‘’
= ๐›ฟ
ln(1 ๐›ฟ )
Count-Min Sketch
• Result
– dynamic data structure CM, item frequency
query
– set ๐‘ค = ๐‘’
๐œ€
and ๐‘‘ = ln(1 ๐›ฟ)
– with probability at least 1 − ๐›ฟ,
๐‘“๐‘— ≤ ๐‘“๐‘— + ๐œ€ โˆ™
๐‘˜ ๐‘“๐‘˜
– sublinear space, does not depend on ๐‘› nor ๐‘š
– running time update ๐‘‚(๐‘‘) and freq. query ๐‘‚(๐‘‘)
Approximate Heavy Hitters
TaeHoon Joseph, Kim
Count-Min Sketch (CMS)
• Inc(๐‘—) takes ๐‘‚ ๐‘‘ time
–๐‘‚ 1×๐‘‘
– update ๐‘‘ values
• Count(๐‘—) takes ๐‘‚ ๐‘‘ time
–๐‘‚ 1×๐‘‘
– return the minimum of ๐‘‘ values
Heavy Hitters Problem
• Input:
– An array of length ๐‘› with ๐‘š distinct items
• Objective:
๐‘›
๐‘˜
– Find all items that occur more than times in the array
• there can be at most ๐‘˜ such items
• Parameter
–๐‘˜
Heavy Hitters Problem: Naïve Solution
• Trivial solution is to use ๐‘‚ ๐‘š array
1. Store all items and each item’s frequency
2. Find all ๐‘˜ items that has frequencies ≥
๐‘›
๐‘˜
๐œ–-Heavy Hitters Problem (๐œ–-๐ป๐ป)
• Relax Heavy Hitters Problem
• Requires sub-linear space
– cannot solve exact problem
– parameters : ๐‘˜ and ๐œ–
๐œ–-Heavy Hitters Problem (๐œ–-๐ป๐ป)
๐‘›
๐‘˜
1. Returns every item occurs more than times
๐‘›
๐‘˜
2. Returns some items that occur more than − ๐œ– โˆ™ ๐‘› times
–
Count min sketch
๐‘“๐‘— ≤ ๐‘“๐‘— + ๐œ€ โˆ™
๐‘“๐‘˜
๐‘˜
Naïve Solution using CMS
…
…
m-2
m-1
m
j
โ„Ž2
โ„Ž1
โ„Ž๐‘‘
1
…
๐‘‘
1
๐‘ค
Naïve Solution using CMS
• Query the frequency of all ๐‘š items
– Return items with Count ๐‘— ≥
• ๐‘‚ ๐‘š๐‘‘
– slow
๐‘›
๐‘˜
Better Solution
• Use CMS to store the frequency
• Use a baseline ๐‘ as a threshold at ๐‘–๐‘กโ„Ž item
–๐‘=
๐‘–
๐‘˜
• Use MinHeap to store potential heavy hitters at ๐‘–๐‘กโ„Ž item
– store new items in MinHeap with frequency ≥ ๐‘
– delete old items from MinHeap with frequency < ๐‘
๐œ–-Heavy Hitters Problem (๐œ–-๐ป๐ป)
๐‘›
๐‘˜
1. Returns every item occurs more than times
๐‘›
๐‘˜
2. Returns some items that occur more than − ๐œ– โˆ™ ๐‘› times
–
1
๐œ– = 2๐‘˜ ,
๐‘›
then ๐‚๐จ๐ฎ๐ง๐ญ ๐‘ฅ ∈ [ ๐‘“๐‘ฅ , ๐‘“๐‘ฅ + 2๐‘˜ ]
–
โ„Ž๐‘’๐‘Ž๐‘ ๐‘ ๐‘–๐‘ง๐‘’ = 2๐‘˜
Algorithm Approximate Heavy Hitters
Input stream ๐‘ฅ, parameter ๐‘˜
For each item ๐‘— ∈ ๐‘ฅ :
1. Update Count Min Sketch
2. Compare the frequency of ๐‘— with ๐‘
3. if count ≥ ๐‘
Insert or update ๐‘— in Min Heap
4. remove any value in Min Heap with frequency < ๐‘
Returns the MinHeap as Heavy Hitters
๐‘–=
EXAMPLES
1
Min-Heap
4
๐‘˜=5
๐‘– 1
๐‘= =
๐‘˜ 5
โ„Ž๐‘‘
โ„Ž2
โ„Ž1
1
1
1
…
๐‘‘
1
1
๐‘ค
๐‘–=
EXAMPLES
1
Min-Heap
4
๐‘˜=5
๐‘– 1
๐‘= =
๐‘˜ 5
โ„Ž๐‘‘
โ„Ž2
{1:4}
โ„Ž1
1
1
1
…
๐‘‘
1
1
๐‘ค
๐‘–=
1
2
3
4
5
4
2
6
9
3
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 5
๐‘= =
๐‘˜ 5
{1:3}
{1:2}
โ„Ž1
โ„Ž๐‘‘
โ„Ž2
{1:9}
{1:4}
1
1
1
…
๐‘‘
1
1
{1:6}
๐‘ค
๐‘–=
1
2
3
4
5
6
4
2
6
9
3
4
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 6
๐‘= =
๐‘˜ 5
{1:3}
{1:2}
โ„Ž๐‘‘
โ„Ž2
โ„Ž1
{1:9}
1
{1:4}
1
1
…
๐‘‘
1
1
{1:6}
๐‘ค
๐‘–=
1
2
3
4
5
6
4
2
6
9
3
4
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 6
๐‘= =
๐‘˜ 5
{1:3}
{1:2}
โ„Ž๐‘‘
โ„Ž2
โ„Ž1
{1:9}
2
{1:4}
1
2
…
๐‘‘
2
1
{1:6}
๐‘ค
๐‘–=
1
2
3
4
5
6
4
2
6
9
3
4
โ„Ž๐‘‘
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 6
๐‘= =
๐‘˜ 5
โ„Ž2
{2:4}
โ„Ž1
2
1
2
…
๐‘‘
2
1
๐‘ค
๐‘–=
…
EXAMPLES
79
Min-Heap
2
๐‘˜=5
๐‘– 79
๐‘= =
= 15.8
๐‘˜
5
{16:4}
{20:9}
โ„Ž1
โ„Ž๐‘‘
โ„Ž2
1
16
18
…
๐‘‘
15
1
๐‘ค
{23:6}
๐‘–=
…
EXAMPLES
79
Min-Heap
2
๐‘˜=5
๐‘– 79
๐‘= =
= 15.8
๐‘˜
5
{16:4}
{20:9}
โ„Ž1
โ„Ž๐‘‘
โ„Ž2
1
17
19
…
๐‘‘
16
1
๐‘ค
{23:6}
๐‘–=
…
EXAMPLES
79
Min-Heap
2
๐‘˜=5
๐‘– 79
๐‘= =
= 15.8
๐‘˜
5
{16:2}
{16:4}
โ„Ž1
โ„Ž๐‘‘
โ„Ž2
{20:9}
1
17
19
…
๐‘‘
16
1
๐‘ค
{23:6}
๐‘–=
…
79
80
81
2
1
2
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 80
๐‘= =
= 16
๐‘˜
5
{16:2}
{16:4}
โ„Ž1
โ„Ž๐‘‘
โ„Ž2
{20:9}
1
3
6
…
๐‘‘
4
1
๐‘ค
{23:6}
๐‘–=
…
79
80
81
2
1
9
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 81
๐‘= =
= 16.2
๐‘˜
5
{16:2}
{16:4}
โ„Ž๐‘‘
โ„Ž1
โ„Ž2
{20:9}
1
20
24
…
๐‘‘
25
1
๐‘ค
{23:6}
๐‘–=
…
79
80
81
2
1
9
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 81
๐‘= =
= 16.2
๐‘˜
5
{16:2}
{16:4}
โ„Ž๐‘‘
โ„Ž1
โ„Ž2
{20:9}
1
21
25
…
๐‘‘
26
1
๐‘ค
{23:6}
๐‘–=
…
79
80
81
2
1
9
EXAMPLES
Min-Heap
๐‘˜=5
๐‘– 81
๐‘= =
= 16.2
๐‘˜
5
{21:9}
{23:6}
โ„Ž๐‘‘
โ„Ž1
โ„Ž2
1
21
25
…
๐‘‘
26
1
๐‘ค
Analysis
• Because ๐‘› is unknown, possible heavy hitters are calculated
and stored every new item comes in
• Maintaining the heap requires extra ๐‘‚ log ๐‘˜ = ๐‘‚ log 1 ๐œ€
time
AMS Sketch : Estimate
Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment
• Stream :
• The Second Moment :
• The trivial solution would be : maintain a histogram of size n and get the
sum of squares
• Its not feasible maintain that large array, therefore we intend to find a
approximation algorithm to achieve sub-linear space complexity with
bounded errors
• The algorithm will give an estimate within ε relative error with δ failure
probability. (Two Parameters)
The Method
j
+g2(j)
+gd-1(j)
d rows
+g1(j)
+gd(j)
• j is the next item in the stream.
• 2-wise independent d hash functions to find the bucket for each row
• After finding the bucket, 4-wise independent d hash functions to
decide inc/dec :
• In a summary :
The Method
j
+g2(j)
+gd-1(j)
d rows
+g1(j)
+gd(j)
• Calculate row estimate
• Median :
4
1
• Choose ๐‘ค = 2 and ๐‘‘ = 8log( ) , by doing so it will give an estimate
๐œ–
๐›ฟ
with ๐œ– relative error and ๐›ฟ failure probability
Why should this method give F2 ?
+gk(j)
d = 8log 1/δ
j
• For kth row :
• Estimate F2 from kth row :
• Each row there would be :
• First part :
• Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore
the expectation is 0.
What guarantee can we give about the accuracy ?
• The variance of Rk, a row estimate, is caused by hashing collisions.
• Given the independent nature of the
hash functions, we can safely
2
๐น
state the variance is bounded by 2 .
• Using Chebyshev Inequality,
• Lets assign,
•
๐‘ค
• Still the failure probability is is linear in over
1
.
๐‘ค
What guarantee can we give about the accuracy ?
• We had d number of hash functions, that produce R1, R2, …. Rd
estimates.
• The Median being wrong ๏ƒ  Half of the estimates are wrong
• These are independent d estimates, like toin-cosses that have
exponentially decaying probability to get the same outcome.
• They have stronger bounds, Chernoff Bounds :
• ๐œ‡ = ๐‘‘ #๐‘’๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘’๐‘  ∗
•
•
๐‘‘
2
๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘–๐‘ 
๐‘‘
4
4
3
(๐‘ ๐‘ข๐‘๐‘๐‘’๐‘ ๐‘  ๐‘๐‘Ÿ๐‘œ๐‘. )
๐‘Ž๐‘ค๐‘Ž๐‘ฆ ๐‘“๐‘Ÿom mean โˆถ
Space and Time Complexity
• E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10
= 80 rows required
• Space complexity is O(log(๐›ฟ)).
• Time complexity will be explained later along with the application
AMS Sketch and Applications
Sapumal Ahangama
Hash functions
• โ„Žk maps the input domain uniformly to 1,2, … ๐‘ค buckets
• โ„Ž๐‘˜ should be a pairwise independent hash functions, to cancel
out product terms
– Ex: family of โ„Ž ๐‘ฅ = ๐‘Ž๐‘ฅ + ๐‘ ๐‘š๐‘œ๐‘‘ ๐‘ ๐‘š๐‘œ๐‘‘ ๐‘ค
– For a and b chosen from prime field ๐‘, ๐‘Ž ≠ 0
Hash functions
• ๐‘”๐‘˜ maps elements from domain uniformly onto {−1, +1}
• ๐‘”๐‘˜ should be four-wise independent
• Ex: family of
g x = ๐‘Ž๐‘ฅ 3 + ๐‘๐‘ฅ 2 + ๐‘๐‘ฅ + ๐‘‘ ๐‘š๐‘œ๐‘‘ ๐‘ equations
• g ๐‘ฅ = 2 ๐‘Ž๐‘ฅ 3 + ๐‘๐‘ฅ 2 + ๐‘๐‘ฅ + ๐‘‘ ๐‘š๐‘œ๐‘‘ ๐‘ ๐‘š๐‘œ๐‘‘ 2 − 1
– for ๐‘Ž, ๐‘, ๐‘, ๐‘‘ chosen uniformly from prime field ๐‘.
Hash functions
• These hash functions can be computed very quickly, faster even than
more familiar (cryptographic) hash functions
• For scenarios which require very high throughput, efficient
implementations are available for hash functions,
– Based on optimizations for particular values of p, and partial
precomputations
– Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with
applications to second moment estimation. In ACM-SIAM Symposium on
Discrete Algorithms, 2004
Time complexity - Update
• The sketch is initialized by picking the hash functions to use,
and initializing the array of counters to all zeros
• For each update operation, the item is mapped to an entry in
each row based on the hash functions โ„Ž๐‘— , multiplied by the
corresponding value of ๐‘”๐‘—
• Processing each update therefore takes time ๐‘‚(๐‘‘)
– since each hash function evaluation takes constant time.
Time complexity - Query
• Found by taking the sum of the squares of each row of the
sketch in turn, and finds the median of these sums.
– That is for each row k, compute ๐‘– ๐ถ๐‘€[๐‘˜, ๐‘–]2
– Take the median of the d such estimates
+gk(j)
d = 8log 1/δ
• Hence the query time is linear in the size of the sketch,
j
๐‘‚(๐‘ค๐‘‘)
Applications - Inner product
• AMS sketch can be used to estimate the inner-product
between a pair of vectors
• Given two frequency distributions ๐‘“ ๐‘Ž๐‘›๐‘‘ ๐‘“′
๐‘€
๐‘“. ๐‘“ ′ =
๐‘“ ๐‘– ∗ ๐‘“ ′ (๐‘–)
๐‘–=1
• AMS sketch based estimator is an unbiased estimator for the
inner product of the vectors
Inner Product
• Two sketches ๐ถ๐‘€ and ๐ถ๐‘€’
• Formed with the same parameters and using the same hash
functions (same ๐‘ค, ๐‘‘, โ„Ž๐‘˜ , ๐‘”๐‘˜ )
• The row estimate is the inner product of the rows,
๐‘ค
๐ถ๐‘€ ๐‘˜, ๐‘– ∗ ๐ถ๐‘€′[๐‘˜, ๐‘–]
๐‘–=1
Inner Product
• Expanding
๐‘ค
๐ถ๐‘€ ๐‘˜, ๐‘– ∗ ๐ถ๐‘€′[๐‘˜, ๐‘–]
๐‘–=1
• Shows that the estimate gives ๐‘“ · ๐‘“′ with additional crossterms due to collisions of items under โ„Ž๐‘˜
• The expectation of these cross terms is zero
– Over the choice of the hash functions, as the function ๐‘”๐‘˜ is equally
likely to add as to subtract any given term.
Inner Product – Join size estimation
• Inner product has a natural interpretation, as the size of the
equi-join between two relations…
• In SQL,
SELECT COUNT(*) FROM D, D’ WHERE D.id =
D’.id
Example
UPDATE(23, 1)
23
h1
d=3
h2
h3
1
2
3
4
5
6
7
8
1
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
w=8
87
Example
UPDATE(23, 1)
23
h1
h2
โ„Ž1 = 3
๐‘”1 = −1
d=3
h3
โ„Ž3 = 7
๐‘”3 = +1
โ„Ž2 = 1
๐‘”2 = −1
1
2
3
4
5
6
7
8
1
0
0
-1
0
0
0
0
0
2
-1
0
0
0
0
0
0
0
3
0
0
0
0
0
0
+1
0
w=8
88
Example
UPDATE(99, 2)
99
h1
d=3
h2
h3
1
2
3
4
5
6
7
8
1
0
0
-1
0
0
0
0
0
2
-1
0
0
0
0
0
0
0
3
0
0
0
0
0
0
+1
0
w=8
89
Example
UPDATE(99, 2)
99
h1
h2
โ„Ž1 = 5
๐‘”1 = +1
d=3
h3
โ„Ž3 = 3
๐‘”3 = +1
โ„Ž2 = 1
๐‘”2 = −1
1
2
3
4
5
6
7
8
1
0
0
-1
0
0
0
0
0
2
-1
0
0
0
0
0
0
0
3
0
0
0
0
0
0
+1
0
w=8
90
Example
UPDATE(99, 2)
99
h1
h2
โ„Ž1 = 5
๐‘”1 = +1
d=3
h3
โ„Ž3 = 3
๐‘”3 = +1
โ„Ž2 = 1
๐‘”2 = −1
1
2
3
4
5
6
7
8
1
0
0
-1
0
+2
0
0
0
2
-3
0
0
0
0
0
0
0
3
0
0
+2
0
0
0
+1
0
w=8
91
Download