pptx - Grigory Yaroslavtsev

advertisement
Sublinear Algorihms for Big Data
Lecture 2
Grigory Yaroslavtsev
http://grigory.us
Recap
• (Markov) For every 𝑐 > 0:
1
Pr 𝑿 ≥ 𝑐 𝔼 𝑿 ≤
𝑐
• (Chebyshev) For every 𝑐 > 0:
π‘‰π‘Žπ‘Ÿ 𝑿
Pr 𝑿 − 𝔼 𝑿 ≥ 𝑐 𝔼 𝑿 ≤
𝑐𝔼 𝑿 2
• (Chernoff) Let 𝑿1 … 𝑿𝒕 be independent and
identically distributed r.vs with range [0, c] and
1
expectation πœ‡. Then if 𝑋 =
𝑋𝑖 and 1 > 𝛿 > 0,
𝑖
𝒕
π’•πœ‡π›Ώ 2
Pr 𝑋 − πœ‡ ≥ π›Ώπœ‡ ≤ 2 exp −
3𝑐
Today
•
•
•
•
•
Approximate Median
Alon-Mathias-Szegedy Sampling
Frequency Moments
Distinct Elements
Count-Min
Data Streams
• Stream: π’Ž elements from universe 𝒏 =
{1, 2, … , 𝒏}, e.g.
π‘₯1 , π‘₯2 , … , π‘₯π’Ž = ⟨5, 8, 1, 1, 1, 4, 3, 5, … , 10⟩
• 𝑓𝑖 = frequency of 𝑖 in the stream = # of
occurrences of value 𝑖
𝑓 = ⟨𝑓1 , … , 𝑓𝒏 ⟩
Approximate Median
• 𝑆 = {π‘₯1 , … , π‘₯π‘š } (all distinct) and let
π‘Ÿπ‘Žπ‘›π‘˜ 𝑦 = |π‘₯ ∈ 𝑆 ∢ π‘₯ ≤ 𝑦|
• Problem: Find πœ–-approximate median, i.e. 𝑦 such
that
π‘š
π‘š
− πœ–π‘š < π‘Ÿπ‘Žπ‘›π‘˜ 𝑦 < + πœ–π‘š
2
2
• Exercise: Can we approximate the value of the
median with additive error ±πœ–𝑛 in sublinear time?
• Algorithm: Return the median of a sample of size 𝑑
taken from 𝑆 (with replacement).
Approximate Median
• Problem: Find πœ–-approximate median, i.e. 𝑦
such that
π‘š
π‘š
− πœ–π‘š < π‘Ÿπ‘Žπ‘›π‘˜ 𝑦 < + πœ–π‘š
2
2
• Algorithm: Return the median of a sample of
size 𝑑 taken from 𝑆 (with replacement).
7
2
log
2
πœ–
𝛿
• Claim: If 𝑑 =
then this algorithm gives πœ–median with probability 1 − 𝛿
Approximate Median
• Partition 𝑆 into 3 groups
π‘š
𝑺𝑳 = π‘₯ ∈ 𝑆: π‘Ÿπ‘Žπ‘›π‘˜ π‘₯ ≤ − πœ–π‘š
2
π‘š
π‘š
𝑺𝑴 = π‘₯ ∈ 𝑆: − πœ–π‘š ≤ π‘Ÿπ‘Žπ‘›π‘˜ π‘₯ ≤ + πœ–π‘š
2
2
π‘š
𝑺𝑼 = π‘₯ ∈ 𝑆: π‘Ÿπ‘Žπ‘›π‘˜ π‘₯ ≥ + πœ–π‘š
2
𝑑
• Key fact: If less than elements from each of 𝑺𝑳 and 𝑺𝑼 are
2
in sample then its median is in 𝑺𝑴
• Let 𝑿𝑖 = 1 if 𝑖-th sample is in 𝑺𝑳 and 0 otherwise.
• Let 𝑿 =
π’Š 𝑿𝑖 .
By Chernoff, if 𝑑 >
7
2
log
πœ–2
𝛿
1
πœ– 2 2−πœ– 𝑑
𝑒− 3
𝑑
𝛿
Pr 𝑿 ≥ ≤ Pr 𝑋 ≥ 1 + πœ– 𝔼 𝑿 ≤
≤
2
2
• Same for 𝑺𝑼 + union bound ⇒ error probability ≤ 𝛿
AMS Sampling
• Problem: Estimate 𝑖∈[𝑛] 𝑔(𝑓𝑖 ), for an arbitrary
function 𝑔 with 𝑔 0 = 0.
• Estimator: Sample π‘₯𝑱 , where 𝑱 is sampled uniformly at
random from [π‘š] and compute:
π‘Ÿ = 𝑗 ≥ 𝑱 ∢ π‘₯𝑗 = π‘₯𝑱
Output: 𝑿 = π‘š(𝑔 π‘Ÿ − 𝑔(π‘Ÿ − 1))
• Expectation:
𝔼𝑿 =
=
𝑖
𝑓𝑖
π‘š
𝑓𝑖
π‘Ÿ=1
Pr π‘₯𝑱 = 𝑖 𝔼[𝑿|π‘₯𝑱 = 𝑖]
𝑖
π‘š 𝑔 π‘Ÿ −𝑔 π‘Ÿ−1
𝑓𝑖
=
𝑔 𝑓𝑖
𝑖
Frequency Moments
• Define πΉπ‘˜ =
π‘˜
𝑓
𝑖 𝑖 for π‘˜ ∈ {0,1,2, … }
– 𝐹0 = # number of distinct elements
– 𝐹1 = # elements
– 𝐹2 = “Gini index”, “surprise index”
Frequency Moments
• Define πΉπ‘˜ =
π‘˜
𝑖 𝑓𝑖
for π‘˜ ∈ {0,1,2, … }
• Use AMS estimator with 𝑿 = π‘š π‘Ÿ π‘˜ − π‘Ÿ − 1
𝔼 𝑿 = πΉπ‘˜
π‘˜
• Exercise: 0 ≤ 𝑿 ≤ π‘š π‘˜ 𝑓∗π‘˜−1 , where 𝑓∗ = max 𝑓𝑖
𝑖
• Repeat 𝑑 times and take average 𝑿. By Chernoff:
π‘‘πΉπ‘˜ πœ– 2
Pr 𝑿 − πΉπ‘˜ ≥ πœ–πΉπ‘˜ ≤ 2 exp −
3π‘š π‘˜ 𝑓∗π‘˜−1
• Taking 𝑑 =
3π‘šπ‘˜π‘“∗π‘˜−1 log
πœ– 2 πΉπ‘˜
1
𝛿
gives Pr 𝑿 − πΉπ‘˜ ≥ πœ–πΉπ‘˜ ≤ 𝛿
Frequency Moments
• Lemma:
π‘šπ‘“∗π‘˜−1
≤ 𝑛1−1/π‘˜
πΉπ‘˜
1
3π‘šπ‘˜π‘“∗π‘˜−1 log
𝛿
πœ– 2 πΉπ‘˜
1
1−
π‘˜π‘› π‘˜
1
𝛿
log
• Result: 𝑑 =
=𝑂
log 𝑛
2
πœ–
memory suffices for πœ–, 𝛿 -approximation of πΉπ‘˜
• Question: What if we don’t know π‘š?
• Then we can use probabilistic guessing (similar to
Morris’s algorithm), replacing log 𝑛 with log π‘›π‘š.
Frequency Moments
• Lemma:
π‘šπ‘“∗π‘˜−1
≤ 𝑛1−1/π‘˜
πΉπ‘˜
π‘š π‘˜
𝑛
• Exercise: πΉπ‘˜ ≥ 𝑛
(Hint: worst-case when 𝑓1 = β‹― =
π‘š
𝑓𝑛 = . Use convexity of 𝑔 π‘₯ = π‘₯ π‘˜ ).
𝑛
• Case 1:
𝑓∗π‘˜
≤𝑛
π‘š π‘˜
𝑛
π‘šπ‘“∗π‘˜−1
πΉπ‘˜
1
π‘˜
π‘š π‘˜−1
π‘šπ‘›
1
1−π‘˜
𝑛
≤
=𝑛
π‘˜
π‘š
𝑛
𝑛
1−
Frequency Moments
• Lemma:
•
π‘šπ‘“∗π‘˜−1
≤ 𝑛1−1/π‘˜
πΉπ‘˜
π‘˜
π‘š
Case 2: 𝑓∗π‘˜ ≥ 𝑛
𝑛
π‘šπ‘“∗π‘˜−1 π‘šπ‘“∗π‘˜−1 π‘š
≤
≤
π‘˜
πΉπ‘˜
𝑓∗
𝑓∗
≤
π‘š
1
𝑛1−π‘˜
1−
π‘š
𝑛
=𝑛
1
π‘˜
Hash Functions
• Definition: A family 𝐻 of functions from 𝐴 → 𝐡 is π‘˜-wise
independent if for any distinct π‘₯1 , … , π‘₯π‘˜ ∈ 𝐴 and 𝑖1 , … π‘–π‘˜ ∈
𝐡:
1
Pr β„Ž π‘₯1 = 𝑖1 , β„Ž π‘₯2 = 𝑖2 , … , β„Ž π‘₯π‘˜ = π‘–π‘˜ =
β„Ž∈𝑅 𝐻
π΅π‘˜
• Example: If 𝐴 ⊆ 0, … , 𝑝 − 1 , 𝐡 = 0, … , 𝑝 − 1 for prime 𝑝
π‘˜−1
π‘Žπ‘– π‘₯ 𝑖 π‘šπ‘œπ‘‘ 𝑝: 0 ≤ π‘Ž0 , π‘Ž1 , … , π‘Žπ‘˜−1 ≤ 𝑝 − 1
𝐻= β„Ž π‘₯ =
𝑖=0
is a π‘˜-wise independent family of hash functions.
Linear Sketches
• Sketching algorithm: picks a random matrix 𝑍 ∈
π‘…π‘˜×𝑛 , where π‘˜ β‰ͺ 𝑛 and computes 𝑍𝑓.
• Can be incrementally updated:
– We have a sketch 𝑍𝑓
– When 𝑖 arrives, new frequencies are 𝑓 ′ = 𝑓 + 𝑒𝑖
– Updating the sketch:
𝑍𝑓 ′ = 𝑍 𝑓 + 𝑒𝑖 = 𝑍𝑓 + 𝑍𝑒𝑖
= 𝑍𝑓 + 𝑖 − π‘‘β„Ž π‘π‘œπ‘™π‘’π‘šπ‘› π‘œπ‘“ 𝑍
• Need to choose random matrices carefully
𝐹2
• Problem: πœ–, 𝛿 -approximation for 𝐹2 =
• Algorithm:
2
𝑓
𝑖 𝑖
– Let 𝑍 ∈ −1,1 π‘˜×𝑛 , where entries of each row are 4wise independent and rows are independent
– Don’t store the matrix: π‘˜ 4-wise independent hash
functions 𝜎
– Compute 𝑍𝑓, average squared entries “appropriately”
• Analysis:
– Let 𝑠 be any entry of 𝑍𝑓.
– Lemma: 𝔼 𝑠 2 = 𝐹2
– Lemma: Var 𝑠 2 ≤ 4𝐹22
𝐹2 : Expectaton
• Let 𝜎 be a row of 𝑍 with entries πœŽπ‘– ∈𝑅 −1,1 .
2
𝑛
𝔼 𝑠2 = 𝔼
πœŽπ‘– 𝑓𝑖
𝑖=1
𝑛
πœŽπ‘–2 𝑓𝑖2 +
= 𝔼
𝑖=1
𝔼[πœŽπ‘– πœŽπ‘— 𝑓𝑖 𝑓𝑗 ]
𝑖≠𝑗
𝑛
𝑓𝑖2 +
=𝔼
𝑖=1
𝔼[πœŽπ‘– πœŽπ‘— ]𝑓𝑖 𝑓𝑗
𝑖≠𝑗
= 𝐹2 + 𝑖≠𝑗 𝔼 πœŽπ‘– 𝔼 πœŽπ‘— 𝑓𝑖 𝑓𝑗 = 𝐹2
• We used 2-wise independence for 𝔼 πœŽπ‘– πœŽπ‘— = 𝔼 πœŽπ‘– 𝔼 πœŽπ‘— .
𝐹2 : Variance
2
𝔼 (𝑋 2 −𝔼𝑋 2 2 ] = 𝔼
πœŽπ‘– πœŽπ‘— 𝑓𝑖 𝑓𝑗
𝑖≠𝑗
πœŽπ‘–2 πœŽπ‘—2 𝑓𝑖2 𝑓𝑗2 + 4
=𝔼 2
𝑖≠𝑗
πœŽπ‘–2 πœŽπ‘— πœŽπ‘˜ 𝑓𝑖2 𝑓𝑗 π‘“π‘˜
𝑖≠𝑗≠π‘˜
𝐹0 : Distinct Elements
• Problem: (πœ–, 𝛿)-approximation for 𝐹0 = 𝑖 𝑓𝑖0
• Simplified: For fixed 𝑇 > 0, with prob. 1 − 𝛿
distinguish:
𝐹0 > 1 + πœ– 𝑇 vs. 𝐹0 < 1 − πœ– 𝑇
• Original problem reduces by trying 𝑂
values of T:
𝑇 = 1, 1 + πœ– , 1 + πœ– 2 , … , 𝑛
log 𝑛
πœ–
𝐹0 : Distinct Elements
• Simplified: For fixed 𝑇 > 0, with prob. 1 − 𝛿
distinguish:
𝐹0 > 1 + πœ– 𝑇 vs. 𝐹0 < 1 − πœ– 𝑇
• Algorithm:
– Choose random sets 𝑆1 , … , π‘†π‘˜ ⊆ [𝑛] where
1
Pr 𝑖 ∈ 𝑆𝑗 =
𝑇
– Compute 𝑠𝑗 =
𝑖∈𝑆𝑗 𝑓𝑖
– If at least π‘˜/𝑒 of the values 𝑠𝑗 are zero, output
𝐹0 < 1 − πœ– 𝑇
𝐹0 > 1 + πœ– 𝑇 vs. 𝐹0 < 1 − πœ– 𝑇
• Algorithm:
– Choose random sets 𝑆1 , … , π‘†π‘˜ ⊆ [𝑛] where Pr 𝑖 ∈ 𝑆𝑗 =
1
𝑇
– Compute 𝑠𝑗 = 𝑖∈𝑆𝑗 𝑓𝑖
– If at least π‘˜/𝑒 of the values 𝑠𝑗 are zero, output 𝐹0 <
1 −πœ– 𝑇
• Analysis:
– If 𝐹0 > 1 + πœ– 𝑇, then Pr 𝑠𝑗 = 0 <
– If 𝐹0 < 1 − πœ– 𝑇, then Pr 𝑠𝑗 = 0 >
– Chernoff: π‘˜ = 𝑂
1
1
log
πœ–2
𝛿
1
𝑒
1
𝑒
πœ–
3
πœ–
+
3
−
gives correctness w.p. 1 − 𝛿
𝐹0 > 1 + πœ– 𝑇 vs. 𝐹0 < 1 − πœ– 𝑇
• Analysis:
– If 𝐹0 > 1 + πœ– 𝑇, then Pr 𝑠𝑗 = 0 <
– If 𝐹0 < 1 − πœ– 𝑇, then Pr 𝑠𝑗 = 0 >
• If 𝑇 is large and πœ– is small then:
Pr 𝑠𝑗 = 0 = 1 −
• If 𝐹0 > 1 + πœ– 𝑇:
𝐹
− 0
𝑒 𝑇
−
+
≈𝑒
≤
𝑒 − 1+πœ–
1 πœ–
≤ −
𝑒 3
≥
𝑒 − 1 −πœ–
1 πœ–
≥ +
𝑒 3
• If 𝐹0 < 1 − πœ– 𝑇:
𝐹
− 0
𝑒 𝑇
1 𝐹0
𝑇
1
𝑒
1
𝑒
πœ–
3
πœ–
3
𝐹
− 𝑇0
Count-Min Sketch
• https://sites.google.com/site/countminsketch/
• Stream: π’Ž elements from universe 𝒏 =
{1, 2, … , 𝒏}, e.g.
π‘₯1 , π‘₯2 , … , π‘₯π’Ž = ⟨5, 8, 1, 1, 1, 4, 3, 5, … , 10⟩
• 𝑓𝑖 = frequency of 𝑖 in the stream = # of
occurrences of value 𝑖, 𝑓 = ⟨𝑓1 , … , 𝑓𝒏 ⟩
• Problems:
– Point Query: For 𝑖 ∈ 𝑛 estimate 𝑓𝑖
– Range Query: For 𝑖, 𝑗 ∈ [𝑛] estimate 𝑓𝑖 + β‹― + 𝑓𝑗
– Quantile Query: For πœ™ ∈ [0,1] find 𝑗 with 𝑓1 + β‹― +
𝑓𝑗 ≈ πœ™π‘š
– Heavy Hitters: For πœ™ ∈ [0,1] find all 𝑖 with 𝑓𝑖 ≥ πœ™π‘š
Count-Min Sketch: Construction
• Let 𝐻1 , … , 𝐻𝑑 : 𝑛 → [𝑀] be 2-wise
independent hash functions
• We maintain 𝑑 ⋅ 𝑀 counters with values:
𝑐𝑖,𝑗 = # elements 𝑒 in the stream with 𝐻𝑖 𝑒 = 𝑗
• For every π‘₯ the value 𝑐𝑖,𝐻𝑖(π‘₯) ≥ 𝑓π‘₯ and so:
𝑓π‘₯ ≤ 𝑓π‘₯ = min(𝑐1,𝐻1 π‘₯ , … , 𝑐𝑑,𝐻1 𝑑 )
• If 𝑀 =
2
πœ–
and 𝑑 =
1
log 2
𝛿
then:
Pr 𝑓π‘₯ ≤ 𝑓π‘₯ ≤ 𝑓π‘₯ + πœ–π‘š ≥ 1 − 𝛿.
Count-Min Sketch: Analysis
• Define random variables 𝒁1 … , π’π‘˜ such that 𝑐𝑖,𝐻𝑖 (π‘₯) = 𝑓π‘₯ + 𝒁𝑖
𝒁𝑖 =
𝑓𝑦
𝑦≠π‘₯,𝐻𝑖 𝑦 =𝐻𝑖 (π‘₯)
• Define 𝑿𝑖,𝑦 = 1 if 𝐻𝑖 𝑦 = 𝐻𝑖 (π‘₯) and 0 otherwise:
𝒁𝑖 =
• By 2-wise independence:
𝔼 𝒁𝑖 = 𝑦≠π‘₯ 𝑓𝑦 𝔼 𝑿𝑖,𝑦 =
• By Markov inequality,
𝑓𝑦 𝑿𝑖,𝑦
𝑦≠π‘₯
𝑦≠π‘₯ 𝑓𝑦
Pr 𝐻𝑖 𝑦 = 𝐻𝑖 π‘₯
1
1
Pr 𝒁𝑖 ≥ πœ–π‘š ≤
=
π‘€πœ– 2
≤
π‘š
𝑀
Count-Min Sketch: Analysis
• All 𝑍𝑖 are independent
𝑑
1
Pr 𝑍𝑖 ≥ πœ–π‘š π‘“π‘œπ‘Ÿ π‘Žπ‘™π‘™ 1 ≤ 𝑖 ≤ 𝑑 ≤
=𝛿
2
• The w.p. 1 − 𝛿 there exists 𝑗 such that 𝑍𝑗 ≤ πœ–π‘š
𝑓π‘₯ = min 𝑐1,𝐻1 π‘₯ , … , 𝑐𝑑,𝐻𝑑 π‘₯ =
= min 𝑓π‘₯ , +𝑍1 … , 𝑓π‘₯ + 𝑍𝑑 ≤ 𝑓π‘₯ + πœ–π‘š
• CountMin estimates values 𝑓π‘₯ up to ±πœ–π‘š with total
memory 𝑂
log π‘š log
πœ–2
1
𝛿
Dyadic Intervals
• Define log 𝑛 partitions of 𝑛 :
𝐼0 = 1,2,3, … 𝑛
𝐼1 = 1,2 , 3,4 , … , 𝑛 − 1, 𝑛
𝐼2 = { 1,2,3,4 , 5,6,7,8 , … , {𝑛 − 3, 𝑛 − 2, 𝑛 − 1, 𝑛}}
…
Ilog n = {{1, 2,3, … , 𝑛}}
• Exercise: Any interval (𝑖, 𝑗) can be written as a disjoint
union of at most 2 log 𝑛 such intervals.
• Example: For 𝑛 = 256: 48,107 = 48,48 ∪ 49,64 ∪
65,96 ∪ 97,104 ∪ 105,106 ∪ 107,107
Count-Min: Range Queries and Quantiles
• Range Query: For 𝑖, 𝑗 ∈ [𝑛] estimate 𝑓𝑖 + β‹― 𝑓𝑗
• Approximate median: Find 𝑗 such that:
𝑓1 + β‹― + 𝑓𝑗 ≥
𝑓1 + β‹― + 𝑓𝑗−1
π‘š
2
+ πœ–π‘š and
π‘š
≤ − πœ–π‘š
2
Count-Min: Range Queries and Quantiles
• Algorithm: Construct log 𝑛 Count-Min sketches,
one for each 𝐼𝑖 such that for any 𝐼 ∈ 𝐼𝑖 we have
an estimate 𝑓𝑙 for 𝑓𝑙 such that:
Pr 𝑓𝑙 ≤ 𝑓𝑙 ≤ 𝑓𝑙 + πœ–π‘š ≥ 1 − 𝛿
• To estimate 𝑖, 𝑗 , let 𝐼1 … , πΌπ‘˜ be decomposition:
𝑓[𝑖,𝑗] = 𝑓𝑙1 + β‹― + π‘“π‘™π‘˜
• Hence,
Pr 𝑓 𝑖,𝑗 ≤ 𝑓 𝑖,𝑗 ≤ 2 πœ–π‘š log 𝑛 ≥ 1 − 2𝛿 log 𝑛
Count-Min: Heavy Hitters
• Heavy Hitters: For πœ™ ∈ [0,1] find all 𝑖 with 𝑓𝑖 ≥ πœ™π‘š
but no elements with 𝑓𝑖 ≤ (πœ™ − πœ–)π‘š
• Algorithm:
– Consider binary tree whose leaves are [n] and
associate internal nodes with intervals
corresponding to descendant leaves
– Compute Count-Min sketches for each 𝐼𝑖
– Level-by-level from root, mark children 𝐼 of
marked nodes if 𝑓𝑙 ≥ πœ™π‘š
– Return all marked leaves
• Finds heavy-hitters in 𝑂(πœ™ −1 log 𝑛) steps
Thank you!
• Questions?
Download