Leveraging Big Data: Lecture 2 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo Counting Distinct Elements 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, ο§ Elements occur multiple times, we want to count the number of distinct elements. ο§ Number of distinct element is π ( =6 in example) ο§ Total number of elements is 11 in this example Exact counting of π distinct element requires a structure of size Ω π ! We are happy with an approximate count that uses a small-size working memory. Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, We want to be able to compute and maintain a small sketch π (π) of the set π of distinct items seen so far π΅ = {ππ, ππ, ππ, π, π, π} Distinct Elements: Approximate Counting ο§ Size of sketch s(π) βͺ π = π ο§ Can query s(N) to get a good estimate π(π ) of π (small relative error) ο§ For a new element π₯, easy to compute s(π ∪ π₯) from s π and π₯ ο§ For data stream computation ο§ If π1 and π2 are (possibly overlapping) sets then we can compute the union sketch from their sketches: π (π1 ∪ π2 ) from π (π1 ) and s π2 ο§ For distributed computation Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, Size-estimation/Minimum value technique: [Flajolet-Martin 85, C 94] β π₯ ∼ π[0,1] β is a random hash function from element IDs to uniform random numbers in [0,1] Maintain the Min-Hash value π¦: ο§ Initialize π¦ ← 1 ο§ Processing an element π₯: π¦ ← min {π¦, β π₯ } Distinct Elements: Approximate Counting π₯ 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, π 0 1 2 3 3 4 4 4 4 5 5 6 β(π₯) 0.45 0.35 0.74 0.45 0.21 0.35 0.45 0.21 0.14 0.35 0.92 π¦ 1 0.45 0.35 0.35 0.35 0.21 0.21 0.21 0.21 0.14 0.14 0.14 The minimum hash value π¦ is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements π. Distinct Elements: Approximate Counting How does the minimum hash π¦ give information on the number of distinct elements π ? 1 0 minimum The expectation of the minimum is π π¦π’π§ π π = π π+π A single value gives only limited information. To boost information, we maintain π ≥ π values Why expectation is 1 ? π+1 ο§ Take a circle of length 1 ο§ Throw a random red point to “mark” the start of a segment of length 1 (circle points map to [0,1] ) ο§ Throw another π point independently at random ο§ The circle is cut into π + 1 segments by these points. ο§ The expected length of each segment is 1 π+1 ο§ Same also for the segment clockwise from the red point. Min-Hash Sketches These sketches maintain π values ππ , ππ , … , ππ from the range of the hash function (distribution). k-mins sketch: Use π “independent” hash functions: β1 , β2 , … , βπ Track the respective minimum π¦1 , π¦2 , … , π¦π for each function. Bottom-k sketch: Use a single hash function: β Track the π smallest values π¦1 , π¦2 , … , π¦π k-partition sketch: Use a single hash function: β′ Use the first log 2 π bits of β′(π₯) to map π₯ uniformly to one of π parts. Call the remaining bits β(x). For π = 1, … , π : Track the minimum hash value π¦π of the elements in part π. All sketches are the same for π = 1 Min-Hash Sketches k-mins, bottom-k, k-partition Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage… Beyond distinct counting: ο§ Min-Hash sketches correspond to sampling schemes of large data sets ο§ Similarity queries between datasets ο§ Selectivity/subset queries ο§ These patterns generally apply as methods to gather increased confidence from a random “projection”/sample. Min-Hash Sketches: Examples k-mins, k-partition, bottom-k π = π π = { 32 , 12 , 14 , 7 , 6 , 4 } The min-hash value and sketches only depend on ο§ The random hash function/s ο§ The set π of distinct elements Not on the order elements appear or their multiplicity Min-Hash Sketches: Example k-mins π = π 32 12 14 7 6 4 π₯ β1 (π₯) 0.45 0.35 0.74 0.21 0.14 0.92 β2 (π₯) 0.19 β3 (π₯) 0.10 0.51 0.07 0.70 0.55 0.20 0.71 0.93 0.50 0.89 0.18 (π¦1 , π¦2 , π¦3 ) = ( 0.14 , 0.07 , 0.10 ) Min-Hash Sketches: k-mins k-mins sketch: Use π “independent” hash functions: β1 , β2 , … , βπ Track the respective minimum π¦1 , π¦2 , … , π¦π for each function. Processing a new element π₯ : For π = 1, … , π : π¦π ← min{ π¦π , βπ π₯ } β1 π₯ = 0.35 β2 π₯ = 0.51 β3 π₯ = 0.71 Computation: π(π) Whether sketch is actually updated or not. Min-Hash Sketches: Example k-partition π = π 32 12 π₯ 3 π(π₯) 2 14 7 6 4 1 1 2 3 β(π₯) 0.07 0.70 0.55 0.20 0.19 0.51 (π¦1 , π¦2 , π¦3 ) = ( 0.07 , 0.19 , 0.20 ) part-hash value-hash Min-Hash Sketches: k-partition k-partition sketch: Use a single hash function: β′ Use the first log 2 π bits of β′(π₯) to map π₯ uniformly to one of π parts. Call the remaining bits β(x). For π = 1, … , π : Track the minimum hash value π¦π of the elements in part π. Processing a new element π₯ : ο§ π ← first log 2 π bits of β′(π₯) ο§ β ← remaining bits of β′(π₯) ο§ π¦π ← min{π¦π , β} π π₯ =2 β π₯ = 0.19 π¦2 ← min{π¦2 , 0.19} Computation: π(1) to test or update Min-Hash Sketches: Example Bottom-k π = π π₯ 32 12 14 β(π₯) 0.19 0.51 0.07 0.70 0.55 0.20 7 (π¦1 , π¦2 , π¦3 ) = ( 0.07 , 6 4 0.19 , 0.20 ) Min-Hash Sketches: bottom-k Bottom-k sketch: Use a single hash function: β Track the π smallest values π¦1 < π¦2 < β― < π¦π Processing a new element π₯ : If β π₯ < yk : (π¦1 , … , π¦π ) ← sort{π¦1 , … , π¦π−1 , β(π₯)} Computation: The sketch (π¦1 , … , π¦π ) is maintained as a sorted list or as a priority queue. ο§ π(1) to test if an update is needed ο§ π(π) to update a sorted list. π(log π) to update a priority queue. We will see that #changes βͺ #distinct elements Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is O(π ln π) Proof: First Consider π = π. Look at distinct elements in the order they first occur. The πth distinct element has lower hash value than the current π minimum with probability . This is the probability of being π’ first in a random permutation of π elements. βΉ Total expected number of updates is π 1 π=1 π = π»π ≤ ln π. 32, 12, 14, 32, 7, 12, 32, 7, 6, Update Prob. 1 1 2 1 3 0 1 4 0 0 0 1 5 12, 4, 0 1 6 Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is O(π ln π) Proof (continued): Recap for π = 1 (single min-hash value): the π th distinct element causes an update with 1 π 1 probability βΉ expected total is π=1 ≤ ln π. i π k-mins: π min-hash values (apply π times) Bottom-k: We keep the π smallest elements, so update π th probability of the π distinct element is min{1, } π (probability of being in the first π in a random permutation) k-partition: π min-hash values for ≈ π/π distinct values. Merging Min-Hash Sketches !! We apply the same set of hash function to all elements/data sets/streams. The union sketch π from sketches of two sets π’,π’’: ο§ k-mins: take minimum per hash function π¦π ← min {π¦π′ , π¦π′′ } ο§ k-partition: take minimum per part π¦i ← min {π¦π′ , π¦π′′ } ο§ Bottom-k: The π smallest in union of data must be in the π smallest of their own set: {π¦1 , … , π¦π } = bottomπ{π¦1′ , … , π¦π′ , π¦1′′ , … , π¦π′′ } Using Min-Hash Sketches ο±Recap: ο§ We defined Min-Hash Sketches (3 types) ο§ Adding elements, merging Min-Hash sketches ο§ Some properties of these sketches ο±Next: We put Min-Hash sketches to work ο§ Estimating Distinct Count from a Min-Hash Sketch ο§ Tools from estimation theory The Exponential Distribution Exp(π) −ππ₯ ο§ PDF πe −ππ₯ , π₯ ≥ 0 ; CDF 1 − e ; π=π= ο§ Very useful properties: οMemorylessness: ∀π‘, π¦ ≥ 0, Pr π₯ > π¦ + π‘ π₯ > π¦] = Pr π₯ > π‘ οMin-to-Sum conversion: min Exp π1 , … , Exp ππ‘ ∼ Exp(π1 + β― + ππ‘ ) ο§ Relation with uniform: ln 1−π’ − π −ππ₯ π’ ∼ π 0,1 ⇔ π₯ ∼ Exp(π) ⇔ 1 − e ∼ Exp(π) ∼ π 0,1 1 n Estimating Distinct Count from a MinHash Sketch: k-mins • Change to exponential distribution β π₯ ∼ Exp(1) • Using Min-to-Sum property, π¦π ∼ Exp(π) – In fact, we can just work with β π₯ ∼ U[0,1] and use π¦π ← −ln 1 − π¦π when estimating. • Number of distinct elements becomes a parameter estimation problem: Given π independent samples from Exp(π) , estimate π Estimating Distinct Count from a MinHash Sketch: k-mins 1 π ο§ Each π¦π ∼ Exp(π) has expectation and variance 2 π = 1 . 2 π π π=1 π¦π ο§ The average π = has expectation π = π 1 π 2 variance π = 2 . The cv is = 1/ π . ππ π ο§ π is a good unbiased estimator ο§ But 1 π 1 for π which is the inverse of what we want. What about estimating π ? 1 π and Estimating Distinct Count from a MinHash Sketch: k-mins What about estimating π ? 1) We can use the biased estimator 1 π = π π π=1 π¦π ο§ To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability 1 1 that π is far from its expectation and thus is far n π from π 2) Maximum Likelihood Estimation (general and powerful technique) Chebyshev’s Inequality For any random variable with expectation π and standard deviation π, for any π ≥ 1 1 Pr π₯ − π ≥ ππ ≤ 2 π For π, π = For π < Pr 1 π 1 , 2 1 π ,π = 1 π π − π ≥ π π ≤ Pr π − Using π = π π 2 1 π ≥ π 2π ≤ 4 π2 π Using Chebyshev’s Inequality For 0 < π < Pr 1 π 1 , 2 π 2 1− > 1 1+π ; 1 1−π >1+π >1+ 1 π π 2 − π ≥ π π = 1 − Pr −ππ ≤ − π ≤ π π = 1 1 − Pr π(1 − π) ≤ ≤ (1 + π) π = π 1 1 1 1 1 − Pr ≥π≥ ≤ π 1−π 1+ππ 1 π π 1 1 − Pr 1+ ≥ π ≥ (1 − ) π 2 2 π 1 π = Pr π − ≥ π 2π Maximum Likelihood Estimation Set of independent π¦π ∼ Fi (π) ; we do not know π The MLE πππΏπΈ is the value that maximizes the likelihood (joint density) function π(π¦; π). The maximum over π of the probability of observing {π¦π } Properties: ο§ Principled way of deriving estimators ο§ Converges in probability to true value (with enough i.i.d samples)… but generally biased ο§ (Asymptotically!) optimal – minimizes MSE (mean square error) – meets Cramér-Rao lower bound Estimating Distinct Count from a MinHash Sketch: k-mins MLE Given π independent samples from Exp(π) , estimate π ο§ Likelihood function for yi (joint density function): −ππ¦π π π=1 πe k −n π π=1 π¦π =n e π π¦; π = ο§ Take a logarithm (does not change the maximum): β π¦1 , … , π¦π ; π = log π π¦; π = π ln π − π ππ=1 π¦π ο§ Differentiate to find maximum: ο§ MLE estimate πππΏπΈ = π πβ π¦;π ππ π π = − π π=1 π¦π =0 π π=1 π¦π We get the same estimator, depends only on the sum! Given π independent samples from Exp(π) , estimate π We can think of several ways to combine and use these π samples and decrease the variance: • average (sum) • median • remove outliers and average remaining, … We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ? Sufficient Statistic A function T y = T y1 , … , π¦π is a sufficient statistic for estimating some function of the parameter π if the likelihood function has the factored form π π¦ π(π π¦ ; π) Likelihood function (joint density) for π exponential i.i.d random variables from Exp(π) : π π π¦; π = πe −ππ¦π k −n π π=1 π¦π =n e π=1 ⇒ The sum π π=1 π¦π is a sufficient statistic for π Sufficient Statistic A function T y = T y1 , … , π¦π is a sufficient statistic for estimating some function of the parameter π if the likelihood function has the factored form π π¦; π = π π¦ π(π π¦ ; π) In particular: The MLE depends on π¦ only through π(π¦) ο§ The maximum with respect to π does not depend on π π¦ . ο§ The maximum of π(π π¦ ; π) , computed by deriving with respect to π, is a function of T π¦ . Sufficient Statistic T y = T y1 , … , π¦π is a sufficient statistic for π if the likelihood function has the form π π¦; π = π π¦ π(π π¦ ; π) Lemma: π π¦ sufficient βΊ Conditional distribution of π¦ given π(π¦) does not depend on π If we fix π π¦ , the density function is π π¦; π ∝ π π¦ If we know the density up to fixed factor, it is determined completely by normalizing to 1 Rao-Blackwell Theorem Recap: T y is a sufficient statistic for π βΊ Conditional distribution of π¦ given π(π¦) does not depend on π Rao-Blackwell Theorem: Given an estimator π(π¦) of π that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on π(π¦): πΈ[π(π¦)|π(π¦)] ο§ πΈ[π(π¦)|π(π¦)] does not depend on π (critical) ο§ Process is called: Rao-Blackwellization of π(π) Rao-Blackwell Theorem π(π¦1 , π¦2 ; π) (1,3) Density function of π¦1 , π¦2 given parameter π (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem Sufficient statistic: π(π¦1 , π¦2 ; π) (1,3) T π¦1 , π¦2 = y1 + y2 (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem π(π¦1 , π¦2 ; π) Sufficient statistic: T π¦1 , π¦2 = y1 + y2 (1,3) (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem π(π¦1 , π¦2 ; π) Sufficient statistic: π π¦1 , π¦2 ; π |y1 + y2 T π¦1 , π¦2 = y1 + y2 (1,3) (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem Estimator π½(ππ , ππ ) 3 (1,3) 0 (4,0) 2 (2,2) 2 (1,2) 1 (3,1) T π¦1 , π¦2 = y1 + y2 1 (2,1) 0 (3,0) 2 (3,2) 4 (1,4) Rao-Blackwell Theorem π½(ππ , ππ ) T π¦1 , π¦2 = y1 + y2 Rao-Blackwell: π½′ = π¬[π½ ππ , ππ |ππ + ππ ] 3 (1,3) 0 1 (2,1) 2 (2,2) 1.5 (4,0) 1 2 (1,2) 1 (3,1) 0 (3,0) 2 (3,2) 4 (1,4) 3 Rao-Blackwell Theorem π½(ππ , ππ ) T π¦1 , π¦2 = y1 + y2 Rao-Blackwell: π½′ = π¬[π½ ππ , ππ |ππ + ππ ] ο§ Law of total expectation: π[π½′] = π¬[π½] Expectation (bias) remains the same ο§ MSE (Mean Square Error) can only decrease ′ MSE π½ ≤ MSE[π½] Why does the MSE decrease? ο§ Suppose we have two points with equal probabilities. We have an estimator of π that gives estimates π and π on these points. ο§ We replace it by an estimator that instead π+π returns the average: 2 ο§ The (scaled) contribution of these two points to the square error changes from π−π 2 + π−π 2 to 2 π+π 2 −π 2 Why does the MSE decrease? Show that π−π 2 + π−π 2 π+π ≥2 −π 2 2 Sufficient Statistic for estimating π from k-mins sketches Given π independent samples from Exp(π) , estimate π ο§ π π¦; π = −ππ¦π π π=1 πe k −n π π=1 π¦π =n e ο§ The sum ππ=1 π¦π is a sufficient statistic for estimating 1 any function of π (including π, , n2 ) π ο§ Rao-Blackwell ⇒ We can not gain by using estimators with a different dependence on {π¦π } (e.g. functions of median or of a smaller sum) Estimating Distinct Count from a MinHash Sketch: k-mins MLE MLE estimate πππΏπΈ = π π π=1 π¦π • π₯ = ππ=1 π¦π , the sum of i.i.d ∼ Exp(π) random variables), has PDF π−1 ππ₯ ππ,π (π₯) = πe−ππ₯ π−1 ! The expectation of the MLE estimate is ∞ π π ππ,π π₯ ππ₯ = π π−1 0 π₯ Estimating Distinct Count from a MinHash Sketch: k-mins Unbiased Estimator π = π−1 π π=1 π¦π (for π > 1) The variance of the unbiased estimate is ∞ 2 π − 1 1 2 2 2 π = π π₯ ππ₯ − π = π π,π 2 π₯ π−2 0 The CV is π π = 1 π−2 Is this the best we can do ? Cramér-Rao lower bound (CRLB) Are we using the information in the sketch in the best possible way ? Cramér-Rao lower bound (CRLB) Information theoretic lower bound on the variance of any unbiased estimator π of π. Likelihood function: π π¦; π Log likelihood: β π¦; π = ln π π¦; π Fisher Information π 2 β π¦; π πΌ π = −E ππ 2 CRLB: Any unbiased estimator has V π ≥ 1 πΌ π CRLB for estimating π ο§ Likelihood function for n, y = yi ο§ π π¦; π = −ππ¦π π π=1 ππ k −n π π=1 π¦π =n e ο§ Log likelihood β π¦; π = πln π − n ο§ Negated second π2 β π¦;π derivative: 2 π π ο§ Fisher information: I n = ο§ CRLB : var π ≥ 1 πΌ π = π2 π π π=1 π¦π = π − 2 π π2 β π¦;π −E[ 2 π π ]= π π2 Estimating Distinct Count from a MinHash Sketch: k-mins Unbiased Estimator π = Our estimator has CV π−1 π π=1 π¦π (for π > 1) 1 π−2 The Cramér-Rao lower bound on CV is 1 π ⇒ we are using the information in the sketch nearly optimally ! Estimating Distinct Count from a MinHash Sketch: Bottom-k Bottom-k sketch π¦1 < π¦2 < β― < π¦π Can we specify the distribution? Use Exponential D. ο§ π = 1 same as k-mins π¦1 ∼ Exp(π) ο§ The minimum π¦2 of the remaining π − 1 elements is Exp π − 1 | π¦2 > π¦1 . Since memoryless, π¦2 − π¦1 ∼ Exp(π − 1). ο§ More generally π¦π+1 − π¦π ∼ Exp(π − π). What is the relation with k-mins sketches? Bottom-k versus k-mins sketches Bottom-k sketch: samples from Exp π , Exp π − 1 , … , Exp(π − π + 1) K-mins sketch: π samples from Exp π To obtain π₯ ∼ Exp π from π§ ∼ Exp π − π (without knowing π) we can take min{π§, π‘} where π‘ ∼ Exp π We can use k-mins estimators with bottom-k. Can do even better by taking expectation over choices of π‘. Bottom-k sketches carry strictly more information than k-mins sketches! Estimating Distinct Count from a MinHash Sketch: Bottom-k Likelihood function of π¦1 , … , π¦π , π: π −(π+1−π)(π¦π −π¦π−1 ) π π¦; π = (π + 1 − π)e = π=1 n! e−(n+1) n−k ! = π π=1(π¦π −π¦π−1 ) ππ¦π − π−1 π=1 π¦π e Does not depend on n e π π π n! e− n−k ! π¦π −π¦π−1 n+1 π¦π Depends on n What does estimation theory tell us? = Estimating Distinct Count from a MinHash Sketch: Bottom-k What does estimation theory tell us? Likelihood function π! ππ¦π − π−1 π¦ π=1 π π π¦; π = e e− π−π ! π+1 π¦π π¦π (maximum value in the sketch) is a sufficient statistic for estimating π (or any function of π ). Captures everything we can glean from the bottom-k sketch on π Bottom-k: MLE for Distinct Count Likelihood function (probability density) is π! ππ¦π − π−1 π¦ π=1 π π π¦; π = e e− π+1 π−π ! π¦π Find the value of π which maximizes π π¦; π : Look only at part that depends on π Take the logarithm (same maximum) π−1 β π¦; π = ln(π − π) − π + 1 π¦π π=0 Bottom-k: MLE for Distinct Count We look for π which maximizes π−1 β π¦; π = ln(π − π) − π + 1 π¦π π=0 πβ π¦; π = ππ π−1 π=0 1 − π¦π π−π π−1 MLE is the solution of: Need to solve numerically π=0 1 = π¦π π−π Summary: k-mins count estimators ο§ k-mins sketch with U 0,1 dist: π¦1 ′, … , π¦π ′ ο§ With Exp dist: π¦1 , … , π¦π π¦π = −ln(1 − π¦π′ ) ο§ Sufficient statistic for (any function of) π : ο§ MLE/Unbiased est for ο§ MLE for π: 1 : π π π=1 π¦π π cv: 1 π CRLB: π π π=1 π¦π ο§ Unbiased est for π: k−1 π π=1 π¦π cv: 1 1 CRLB: π−2 π π π=1 π¦π 1 π Summary: bottom-k count estimators ο§ Bottom-k sketch with U 0,1 : π¦1′ <β―< ′ π¦π ο§ With Exp dist: π¦1 < β― < π¦π π¦π = −ln(1 − π¦π′ ) ο§ Sufficient statistic for (any function of) π : π¦π ο§ Contains strictly more information than k-mins ο§ When π β« π, approximately the same as k-mins π−1 ο§ MLE for π is the solution of: 1 = π¦π π−π π=0 Bibliography • • • • • See lecture 3 We will continue with Min Hash sketches Use as random samples Applications to similarity Inverse-Probability based distinct count estimators