# Lecture 3 (streaming)

```Sketching, Sampling and other Sublinear
Algorithms:
Streaming
Alex Andoni
(MSR SVC)
A scenario
Challenge:
compute something on the
table,
131.107.65.14
using small space.
18.9.22.69
Example
of “something”:
131.107.65.14
• # distinct IPs
• max
frequency
80.97.56.20
• other statistics…
18.9.22.69
IP
Frequency
131.107.65.14
3
18.9.22.69
2
80.97.56.20
2
128.112.128.81
9
127.0.0.1
8
257.2.5.7
0
7.8.20.13
1
80.97.56.20
131.107.65.14
Sublinear: a panacea?

Sub-linear space algorithm for solving Travelling
Salesperson Problem?


Hard to solve sublinearly even very simple problems:


Sorry, perhaps a different lecture
Ex: what is the count of distinct IPs seen
Will settle for:

Approximate algorithms: 1+ approximation
IP
Frequency
131.107.65.14
3
18.9.22.69
2
80.97.56.20
2
128.112.128.81
9
127.0.0.1
8
257.2.5.7
0
8.3.20.12
1


Randomized: above holds with probability 95%
Quick and dirty way to get a sense of the data
Streaming data


Data through a router
Data stored on a hard drive, or streamed remotely


More efficient to do a linear scan on a hard drive
Working memory is the (smaller) main memory
2
2
Application areas

Data can come from:





Network logs, sensor data
Real time data
Databases (query planning)
…
Problem 1: # distinct elements



Problem: compute the number of distinct elements in the
stream
Trivial solution: () space for  distinct elements
Will see: (log ) space (approximate)
2 5 7 5 5
i
Frequency
2
1
5
3
7
1
Distinct Elements: idea 1
[Flajolet-Martin’85, Alon-Matias-Szegedy’96]

Algorithm:




Hash function ℎ:  → 0,1
Compute ℎ = min∈ ℎ()
1
Output is
−1
ℎ

Process(int i):
if (h(i) < minHash)
minHash = h(index);
repeats of the same element i don’t matter
1
=
, for  distinct elements
+1
5
0
Initialize:
minHash=1
hash function h into [0,1]
Output: 1/minHash-1
“Analysis”:

Algorithm DISTINCT:
ℎ(5)
1/( + 1)
7
ℎ(7)
2
ℎ(2)
1
Distinct Elements: idea 2
Algorithm
Algorithm DISTINCT:
DISTINCT:

Store ℎ approximately



Randomness: 2-wise enough!


Store just the count of
trailing zeros
Need only (log log ) bits
Initialize:
Initialize:
minHash2=0
minHash=1
hash
hash function
function hh into
into [0,1]
[0,1]
Process(int
Process(int i):
i):
if
if (h(i)
(h(i) << 1/2^minHash2)
minHash)
minHash2
minHash == h(index);
ZEROS(h(index));
Output:
Output:2^minHash2
1/minHash-1
(log ) bits
Better accuracy using more space:



x=0.0000001100101
ZEROS(x)
error 1 +
repeat (1/ 2 ) times with different hash functions
HyperLogLog: can also with just one hash function [FFGM’07]
Problem 2: max count heavy hitters


Problem: compute the maximum frequency of an element
in the stream


2 5 7 5 5
Hard to distinguish whether an element repeated (max = 1 vs 2)
Good news:

Can find “heavy hitters”


elements with frequency > total frequency / s
using space proportional to s
IP
Frequency
2
1
5
3
7
1
Heavy Hitters: CountMin
[Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05]
Algorithm CountMin:
2
ℎ3 2
321
5
ℎ1 (2)
7
5
Initialize(r, L):
array Sketch[L][w]
L hash functions h[L], into {0,…w-1}
21
4321
321
1

freq
freq
freq
freq
11
ℎ2 (2)
1
1
5
2 =1
5 =3
7 =1
11 = 1

Process(int i):
for(j=0; j<L; j++)
Sketch[j][ h[j](i) ] += 1;
Output:
foreach i in PossibleIP {
freq[i] = int.MaxValue;
for(j=0; j<L; j++)
freq[i] = min(freq[i],
Sketch[j][h[j](i)]);
}
// freq[] is the frequency estimate
Heavy Hitters: analysis
5

3
2
1
1




3
mass”

Algorithm CountMin:
4

1
= frequency of 5, plus “extra
Expected “extra mass” ≤ total mass / w
Chebyshev: true with probability >1/2
= (log ) to get high probability
(for all  elements)
Compute heavy hitters from freq[]
Initialize(r, L):
array Sketch[L][w]
L hash functions h[L], into {0,…w-1}
Process(int i):
for(j=0; j<L; j++)
Sketch[j][ h[j](i) ] += 1;
Output:
foreach i in PossibleIP {
freq[i] = int.MaxValue;
for(j=0; j<L; j++)
freq[i] = min(freq[i],
Sketch[j][h[j](i)]);
}
// freq[] is the frequency estimate
Problem 3: Moments

Problem: compute frequency moment


variance 2 =  ()2 or
higher moments  =  () for  > 2


Skewness (k=3), kurtosis (k=4), etc
a different proxy for max: lim   = max ()
→∞
IP
Frequency
()
2
1
5
3
7
2
()
()
1
1
9
81
4
16
2 =1+9+4=14
2 = 3.74
4 =1+81+16=98
4
4 = 3.15
2 moment


Use Johnson-Lindenstrauss lemma! (2nd lecture)
Store sketch  =



Update on element :




( +  ) =  +
Guarantees:


= frequency vector
=  by  matrix of Gaussian entries
= (1/ 2 ) counters (words)
() time to update
Better: ±1 entries, (1) update [AMS’96, TZ’04]
: precision sampling => next
Scenario 2: distributed traffic

Statistics on traffic difference/aggregate between two routers


Eg: traffic different by how many packets?
Linearity is the power!


Sketch(data 1) + Sketch(data 2) = Sketch(data 1 + data 2)
Sketch(data 1) - Sketch(data 2) = Sketch(data 1 - data 2) 131.107.65.14
35.8.10.140
18.9.22.69
IP
Frequency
IP
18.9.22.69
Frequency
131.107.65.14
1
131.107.65.14
1
18.9.22.69
1
18.9.22.69
2
35.8.10.140
1
Two sketches should be sufficient to compute
something on the difference or sum
Common primitive: estimate sum

Given:  quantities 1 , 2 , …  in the range [0,1]
Goal: estimate  = 1 + 2 + ⋯  “cheaply”

Standard sampling: pick random set  = {1, … } of size




Estimator:  =

⋅ (1 + 2 + ⋯  )
Chebyshev bound: with 90% success probability
1
– (/) <  < 2 + (/)
2
For constant additive error, need  = Ω()
Compute an estimate  from 1, 3
a3
a1
a1
a2
a3
a4
Precision Sampling Framework

Alternative “access” to  ’s:



For each term  , we get a (rough) estimate
up to some precision  , chosen in advance: | –  | <


quality of approximation to
use only weak precisions  (minimize “cost” of estimating )
Compute an estimate  from 1 , 2 , 3 , 4
u1
a1
ã1
u2
a2
ã2
u3
ã3
a3
u4
ã4
a4
Formalization
Sum Estimator
1. fix precisions
1. fix 1, 2, …
3. given 1 , 2 , …  , output  s.t.
−  < 1.

What is cost?



2. fix 1 , 2 , …  s.t. | −  | <
Here, average cost = 1/ ⋅ 1/
to achieve precision , use 1/ “resources”: e.g., if  is itself a sum  =
computed by subsampling, then one needs Θ(1/ ) samples
For example, can choose all  = 1/

Average cost ≈
Precision Sampling Lemma
[A-Krauthgamer-Onak’11]


Goal: estimate ∑ai from {ãi} satisfying |ai-ãi|<ui.
Precision Sampling Lemma: can get, with 90% success:

O(1)
1.5 multiplicative error:
– ε <<S̃S̃ << (1+
ε)S
+ε
S –S O(1)
1.5*S
+ O(1)


O(ε-3 log
with average cost equal to O(log
n) n)
Example: distinguish Σai=3 vs Σai=0

Consider two extreme cases:

if three ai=1: enough to have crude approx for all (ui=0.1)
if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1
Precision Sampling Algorithm

Precision Sampling Lemma: can get, with 90% success:

O(1)
1.5 multiplicative error:
S –S O(1)
1.5*S
+ O(1)
– ε <<S̃S̃ << (1+
ε)S
+ε


Algorithm:



O(ε-3 log
with average cost equal to O(log
n) n)
Choose each ui[0,1]
i.i.d. distrib. = minimum of O(ε-3) u.r.v.
concrete
function
of [ãi of
/uii‘s- 4/ε]
Estimator: S̃ = count
number
s.t. ã+i and
/ ui >u6i’s (up to a
normalization constant)
Proof of correctness:



we use only ãi which are 1.5-approximation to ai
E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6.
E[1/ui] = O(log n) w.h.p.
Moments ( ) via precision sampling


Theorem: linear sketch for  with (1) approximation,
and (1−2/ log ) space (90% succ. prob.).
Sketch:
Pick random [0,1], {±1}, and let  =  ⋅  /

throw into one hash table ,
x= x1
1−2/
= (
log ) cells


x2
x3
x4
Estimator:


1/

max

y1+ y4
H= y
3
Randomness: (1) independence suffices
y2+
y5+
y6
x5
x6
Streaming++

LOTS of work in the area:

Surveys





McGregor: http://people.cs.umass.edu/~mcgregor/papers/08graphmining.pdf
Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf
Open problems: http://sublinear.info
Examples:



Moments, sampling
Median estimation, longest increasing sequence
Graph algorithms


Numerical algorithms (e.g., regression, SVD approximation)


E.g., dynamic graph connectivity [AGG’12, KKM’13,…]
Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13]
related to Compressed Sensing
```

16 cards

73 cards

14 cards

49 cards

23 cards