Sublinear Algorihms for Big Data Lecture 2 Grigory Yaroslavtsev http://grigory.us Recap • (Markov) For every π > 0: 1 Pr πΏ ≥ π πΌ πΏ ≤ π • (Chebyshev) For every π > 0: πππ πΏ Pr πΏ − πΌ πΏ ≥ π πΌ πΏ ≤ ππΌ πΏ 2 • (Chernoff) Let πΏ1 … πΏπ be independent and identically distributed r.vs with range [0, c] and 1 expectation π. Then if π = ππ and 1 > πΏ > 0, π π πππΏ 2 Pr π − π ≥ πΏπ ≤ 2 exp − 3π Today • • • • • Approximate Median Alon-Mathias-Szegedy Sampling Frequency Moments Distinct Elements Count-Min Data Streams • Stream: π elements from universe π = {1, 2, … , π}, e.g. π₯1 , π₯2 , … , π₯π = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉 • ππ = frequency of π in the stream = # of occurrences of value π π = 〈π1 , … , ππ 〉 Approximate Median • π = {π₯1 , … , π₯π } (all distinct) and let ππππ π¦ = |π₯ ∈ π βΆ π₯ ≤ π¦| • Problem: Find π-approximate median, i.e. π¦ such that π π − ππ < ππππ π¦ < + ππ 2 2 • Exercise: Can we approximate the value of the median with additive error ±ππ in sublinear time? • Algorithm: Return the median of a sample of size π‘ taken from π (with replacement). Approximate Median • Problem: Find π-approximate median, i.e. π¦ such that π π − ππ < ππππ π¦ < + ππ 2 2 • Algorithm: Return the median of a sample of size π‘ taken from π (with replacement). 7 2 log 2 π πΏ • Claim: If π‘ = then this algorithm gives πmedian with probability 1 − πΏ Approximate Median • Partition π into 3 groups π πΊπ³ = π₯ ∈ π: ππππ π₯ ≤ − ππ 2 π π πΊπ΄ = π₯ ∈ π: − ππ ≤ ππππ π₯ ≤ + ππ 2 2 π πΊπΌ = π₯ ∈ π: ππππ π₯ ≥ + ππ 2 π‘ • Key fact: If less than elements from each of πΊπ³ and πΊπΌ are 2 in sample then its median is in πΊπ΄ • Let πΏπ = 1 if π-th sample is in πΊπ³ and 0 otherwise. • Let πΏ = π πΏπ . By Chernoff, if π‘ > 7 2 log π2 πΏ 1 π 2 2−π π‘ π− 3 π‘ πΏ Pr πΏ ≥ ≤ Pr π ≥ 1 + π πΌ πΏ ≤ ≤ 2 2 • Same for πΊπΌ + union bound ⇒ error probability ≤ πΏ AMS Sampling • Problem: Estimate π∈[π] π(ππ ), for an arbitrary function π with π 0 = 0. • Estimator: Sample π₯π± , where π± is sampled uniformly at random from [π] and compute: π = π ≥ π± βΆ π₯π = π₯π± Output: πΏ = π(π π − π(π − 1)) • Expectation: πΌπΏ = = π ππ π ππ π=1 Pr π₯π± = π πΌ[πΏ|π₯π± = π] π π π π −π π−1 ππ = π ππ π Frequency Moments • Define πΉπ = π π π π for π ∈ {0,1,2, … } – πΉ0 = # number of distinct elements – πΉ1 = # elements – πΉ2 = “Gini index”, “surprise index” Frequency Moments • Define πΉπ = π π ππ for π ∈ {0,1,2, … } • Use AMS estimator with πΏ = π π π − π − 1 πΌ πΏ = πΉπ π • Exercise: 0 ≤ πΏ ≤ π π π∗π−1 , where π∗ = max ππ π • Repeat π‘ times and take average πΏ. By Chernoff: π‘πΉπ π 2 Pr πΏ − πΉπ ≥ ππΉπ ≤ 2 exp − 3π π π∗π−1 • Taking π‘ = 3πππ∗π−1 log π 2 πΉπ 1 πΏ gives Pr πΏ − πΉπ ≥ ππΉπ ≤ πΏ Frequency Moments • Lemma: ππ∗π−1 ≤ π1−1/π πΉπ 1 3πππ∗π−1 log πΏ π 2 πΉπ 1 1− ππ π 1 πΏ log • Result: π‘ = =π log π 2 π memory suffices for π, πΏ -approximation of πΉπ • Question: What if we don’t know π? • Then we can use probabilistic guessing (similar to Morris’s algorithm), replacing log π with log ππ. Frequency Moments • Lemma: ππ∗π−1 ≤ π1−1/π πΉπ π π π • Exercise: πΉπ ≥ π (Hint: worst-case when π1 = β― = π ππ = . Use convexity of π π₯ = π₯ π ). π • Case 1: π∗π ≤π π π π ππ∗π−1 πΉπ 1 π π π−1 ππ 1 1−π π ≤ =π π π π π 1− Frequency Moments • Lemma: • ππ∗π−1 ≤ π1−1/π πΉπ π π Case 2: π∗π ≥ π π ππ∗π−1 ππ∗π−1 π ≤ ≤ π πΉπ π∗ π∗ ≤ π 1 π1−π 1− π π =π 1 π Hash Functions • Definition: A family π» of functions from π΄ → π΅ is π-wise independent if for any distinct π₯1 , … , π₯π ∈ π΄ and π1 , … ππ ∈ π΅: 1 Pr β π₯1 = π1 , β π₯2 = π2 , … , β π₯π = ππ = β∈π π» π΅π • Example: If π΄ ⊆ 0, … , π − 1 , π΅ = 0, … , π − 1 for prime π π−1 ππ π₯ π πππ π: 0 ≤ π0 , π1 , … , ππ−1 ≤ π − 1 π»= β π₯ = π=0 is a π-wise independent family of hash functions. Linear Sketches • Sketching algorithm: picks a random matrix π ∈ π π×π , where π βͺ π and computes ππ. • Can be incrementally updated: – We have a sketch ππ – When π arrives, new frequencies are π ′ = π + ππ – Updating the sketch: ππ ′ = π π + ππ = ππ + πππ = ππ + π − π‘β ππππ’ππ ππ π • Need to choose random matrices carefully πΉ2 • Problem: π, πΏ -approximation for πΉ2 = • Algorithm: 2 π π π – Let π ∈ −1,1 π×π , where entries of each row are 4wise independent and rows are independent – Don’t store the matrix: π 4-wise independent hash functions π – Compute ππ, average squared entries “appropriately” • Analysis: – Let π be any entry of ππ. – Lemma: πΌ π 2 = πΉ2 – Lemma: Var π 2 ≤ 4πΉ22 πΉ2 : Expectaton • Let π be a row of π with entries ππ ∈π −1,1 . 2 π πΌ π 2 = πΌ ππ ππ π=1 π ππ2 ππ2 + = πΌ π=1 πΌ[ππ ππ ππ ππ ] π≠π π ππ2 + =πΌ π=1 πΌ[ππ ππ ]ππ ππ π≠π = πΉ2 + π≠π πΌ ππ πΌ ππ ππ ππ = πΉ2 • We used 2-wise independence for πΌ ππ ππ = πΌ ππ πΌ ππ . πΉ2 : Variance 2 πΌ (π 2 −πΌπ 2 2 ] = πΌ ππ ππ ππ ππ π≠π ππ2 ππ2 ππ2 ππ2 + 4 =πΌ 2 π≠π ππ2 ππ ππ ππ2 ππ ππ π≠π≠π πΉ0 : Distinct Elements • Problem: (π, πΏ)-approximation for πΉ0 = π ππ0 • Simplified: For fixed π > 0, with prob. 1 − πΏ distinguish: πΉ0 > 1 + π π vs. πΉ0 < 1 − π π • Original problem reduces by trying π values of T: π = 1, 1 + π , 1 + π 2 , … , π log π π πΉ0 : Distinct Elements • Simplified: For fixed π > 0, with prob. 1 − πΏ distinguish: πΉ0 > 1 + π π vs. πΉ0 < 1 − π π • Algorithm: – Choose random sets π1 , … , ππ ⊆ [π] where 1 Pr π ∈ ππ = π – Compute π π = π∈ππ ππ – If at least π/π of the values π π are zero, output πΉ0 < 1 − π π πΉ0 > 1 + π π vs. πΉ0 < 1 − π π • Algorithm: – Choose random sets π1 , … , ππ ⊆ [π] where Pr π ∈ ππ = 1 π – Compute π π = π∈ππ ππ – If at least π/π of the values π π are zero, output πΉ0 < 1 −π π • Analysis: – If πΉ0 > 1 + π π, then Pr π π = 0 < – If πΉ0 < 1 − π π, then Pr π π = 0 > – Chernoff: π = π 1 1 log π2 πΏ 1 π 1 π π 3 π + 3 − gives correctness w.p. 1 − πΏ πΉ0 > 1 + π π vs. πΉ0 < 1 − π π • Analysis: – If πΉ0 > 1 + π π, then Pr π π = 0 < – If πΉ0 < 1 − π π, then Pr π π = 0 > • If π is large and π is small then: Pr π π = 0 = 1 − • If πΉ0 > 1 + π π: πΉ − 0 π π − + ≈π ≤ π − 1+π 1 π ≤ − π 3 ≥ π − 1 −π 1 π ≥ + π 3 • If πΉ0 < 1 − π π: πΉ − 0 π π 1 πΉ0 π 1 π 1 π π 3 π 3 πΉ − π0 Count-Min Sketch • https://sites.google.com/site/countminsketch/ • Stream: π elements from universe π = {1, 2, … , π}, e.g. π₯1 , π₯2 , … , π₯π = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉 • ππ = frequency of π in the stream = # of occurrences of value π, π = 〈π1 , … , ππ 〉 • Problems: – Point Query: For π ∈ π estimate ππ – Range Query: For π, π ∈ [π] estimate ππ + β― + ππ – Quantile Query: For π ∈ [0,1] find π with π1 + β― + ππ ≈ ππ – Heavy Hitters: For π ∈ [0,1] find all π with ππ ≥ ππ Count-Min Sketch: Construction • Let π»1 , … , π»π : π → [π€] be 2-wise independent hash functions • We maintain π ⋅ π€ counters with values: ππ,π = # elements π in the stream with π»π π = π • For every π₯ the value ππ,π»π(π₯) ≥ ππ₯ and so: ππ₯ ≤ ππ₯ = min(π1,π»1 π₯ , … , ππ,π»1 π ) • If π€ = 2 π and π = 1 log 2 πΏ then: Pr ππ₯ ≤ ππ₯ ≤ ππ₯ + ππ ≥ 1 − πΏ. Count-Min Sketch: Analysis • Define random variables π1 … , ππ such that ππ,π»π (π₯) = ππ₯ + ππ ππ = ππ¦ π¦≠π₯,π»π π¦ =π»π (π₯) • Define πΏπ,π¦ = 1 if π»π π¦ = π»π (π₯) and 0 otherwise: ππ = • By 2-wise independence: πΌ ππ = π¦≠π₯ ππ¦ πΌ πΏπ,π¦ = • By Markov inequality, ππ¦ πΏπ,π¦ π¦≠π₯ π¦≠π₯ ππ¦ Pr π»π π¦ = π»π π₯ 1 1 Pr ππ ≥ ππ ≤ = π€π 2 ≤ π π€ Count-Min Sketch: Analysis • All ππ are independent π 1 Pr ππ ≥ ππ πππ πππ 1 ≤ π ≤ π ≤ =πΏ 2 • The w.p. 1 − πΏ there exists π such that ππ ≤ ππ ππ₯ = min π1,π»1 π₯ , … , ππ,π»π π₯ = = min ππ₯ , +π1 … , ππ₯ + ππ ≤ ππ₯ + ππ • CountMin estimates values ππ₯ up to ±ππ with total memory π log π log π2 1 πΏ Dyadic Intervals • Define log π partitions of π : πΌ0 = 1,2,3, … π πΌ1 = 1,2 , 3,4 , … , π − 1, π πΌ2 = { 1,2,3,4 , 5,6,7,8 , … , {π − 3, π − 2, π − 1, π}} … Ilog n = {{1, 2,3, … , π}} • Exercise: Any interval (π, π) can be written as a disjoint union of at most 2 log π such intervals. • Example: For π = 256: 48,107 = 48,48 ∪ 49,64 ∪ 65,96 ∪ 97,104 ∪ 105,106 ∪ 107,107 Count-Min: Range Queries and Quantiles • Range Query: For π, π ∈ [π] estimate ππ + β― ππ • Approximate median: Find π such that: π1 + β― + ππ ≥ π1 + β― + ππ−1 π 2 + ππ and π ≤ − ππ 2 Count-Min: Range Queries and Quantiles • Algorithm: Construct log π Count-Min sketches, one for each πΌπ such that for any πΌ ∈ πΌπ we have an estimate ππ for ππ such that: Pr ππ ≤ ππ ≤ ππ + ππ ≥ 1 − πΏ • To estimate π, π , let πΌ1 … , πΌπ be decomposition: π[π,π] = ππ1 + β― + πππ • Hence, Pr π π,π ≤ π π,π ≤ 2 ππ log π ≥ 1 − 2πΏ log π Count-Min: Heavy Hitters • Heavy Hitters: For π ∈ [0,1] find all π with ππ ≥ ππ but no elements with ππ ≤ (π − π)π • Algorithm: – Consider binary tree whose leaves are [n] and associate internal nodes with intervals corresponding to descendant leaves – Compute Count-Min sketches for each πΌπ – Level-by-level from root, mark children πΌ of marked nodes if ππ ≥ ππ – Return all marked leaves • Finds heavy-hitters in π(π −1 log π) steps Thank you! • Questions?