lecture2

advertisement
Leveraging Big Data: Lecture 2
http://www.cohenwang.com/edith/bigdataclass2013
Instructors:
Edith Cohen
Amos Fiat
Haim Kaplan
Tova Milo
Counting Distinct Elements
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
 Elements occur multiple times, we want to
count the number of distinct elements.
 Number of distinct element is 𝒏 ( =6 in
example)
 Total number of elements is 11 in this example
Exact counting of 𝑛 distinct element requires a structure
of size ٠𝑛 !
We are happy with an approximate count that uses a
small-size working memory.
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
We want to be able to compute and maintain
a small sketch 𝑠(𝑁) of the set 𝑁 of distinct
items seen so far 𝑡 = {πŸ‘πŸ, 𝟏𝟐, πŸπŸ’, πŸ•, πŸ”, πŸ’}
Distinct Elements: Approximate Counting
 Size of sketch s(𝑁) β‰ͺ 𝑁 = 𝑛
 Can query s(N) to get a good estimate 𝑛(𝑠) of 𝑛
(small relative error)
 For a new element π‘₯, easy to compute s(𝑁 ∪ π‘₯)
from s 𝑁 and π‘₯
 For data stream computation
 If 𝑁1 and 𝑁2 are (possibly overlapping) sets then
we can compute the union sketch from their
sketches: 𝑠(𝑁1 ∪ 𝑁2 ) from 𝑠(𝑁1 ) and s 𝑁2
 For distributed computation
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
Size-estimation/Minimum value technique:
[Flajolet-Martin 85, C 94]
β„Ž π‘₯ ∼ π‘ˆ[0,1] β„Ž is a random hash function from
element IDs to uniform random numbers in [0,1]
Maintain the Min-Hash value 𝑦:
 Initialize 𝑦 ← 1
 Processing an element π‘₯: 𝑦 ← min {𝑦, β„Ž π‘₯ }
Distinct Elements: Approximate Counting
π‘₯
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
𝑛 0 1 2 3 3 4 4 4 4 5 5 6
β„Ž(π‘₯) 0.45 0.35 0.74 0.45 0.21 0.35 0.45 0.21 0.14 0.35 0.92
𝑦 1
0.45 0.35 0.35
0.35 0.21 0.21
0.21
0.21
0.14
0.14
0.14
The minimum hash value 𝑦 is:
Unaffected by repeated elements.
Is non-increasing with the number of distinct elements 𝑛.
Distinct Elements: Approximate Counting
How does the minimum hash 𝑦 give information on
the number of distinct elements 𝑛 ?
1
0
minimum
The expectation of the minimum is 𝐄 𝐦𝐒𝐧 𝒉 𝒙
=
𝟏
𝒏+𝟏
A single value gives only limited information.
To boost information, we maintain π’Œ ≥ 𝟏 values
Why expectation is
1
?
𝑛+1
 Take a circle of length 1
 Throw a random red point to “mark” the start of a
segment of length 1 (circle points map to [0,1] )
 Throw another 𝑛 point independently at random
 The circle is cut into 𝑛 + 1 segments by these points.
 The expected length of each segment is
1
𝑛+1
 Same also for the segment clockwise from the red
point.
Min-Hash Sketches
These sketches maintain π‘˜ values π’šπŸ , π’šπŸ , … , π’šπ’Œ from the
range of the hash function (distribution).
k-mins sketch: Use π‘˜ “independent” hash functions: β„Ž1 , β„Ž2 , … , β„Žπ‘˜
Track the respective minimum 𝑦1 , 𝑦2 , … , π‘¦π‘˜ for each function.
Bottom-k sketch: Use a single hash function: β„Ž
Track the π‘˜ smallest values 𝑦1 , 𝑦2 , … , π‘¦π‘˜
k-partition sketch: Use a single hash function: β„Ž′
Use the first log 2 π‘˜ bits of β„Ž′(π‘₯) to map π‘₯ uniformly to one of π‘˜
parts. Call the remaining bits β„Ž(x).
For 𝑖 = 1, … , π‘˜ : Track the minimum hash value 𝑦𝑖 of the elements
in part 𝑖.
All sketches are the same for π‘˜ = 1
Min-Hash Sketches
k-mins, bottom-k, k-partition
Why study all 3 variants ? Different tradeoffs between
update cost, accuracy, usage…
Beyond distinct counting:
 Min-Hash sketches correspond to sampling schemes of
large data sets
 Similarity queries between datasets
 Selectivity/subset queries
 These patterns generally apply as methods to gather
increased confidence from a random “projection”/sample.
Min-Hash Sketches: Examples
k-mins, k-partition, bottom-k π’Œ = πŸ‘
𝑁 = { 32 , 12 , 14 , 7 , 6 , 4 }
The min-hash value and sketches only depend on
 The random hash function/s
 The set 𝑁 of distinct elements
Not on the order elements appear or their multiplicity
Min-Hash Sketches: Example
k-mins π’Œ = πŸ‘
32 12 14 7 6 4
π‘₯
β„Ž1 (π‘₯) 0.45 0.35 0.74 0.21 0.14 0.92
β„Ž2 (π‘₯) 0.19
β„Ž3 (π‘₯) 0.10
0.51
0.07 0.70 0.55 0.20
0.71
0.93 0.50 0.89 0.18
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.14 ,
0.07 , 0.10 )
Min-Hash Sketches: k-mins
k-mins sketch: Use π‘˜ “independent” hash functions: β„Ž1 , β„Ž2 , … , β„Žπ‘˜
Track the respective minimum 𝑦1 , 𝑦2 , … , π‘¦π‘˜ for each function.
Processing a new element π‘₯ :
For 𝑖 = 1, … , π‘˜ : 𝑦𝑖 ← min{ 𝑦𝑖 , β„Žπ‘– π‘₯ }
β„Ž1 π‘₯ = 0.35
β„Ž2 π‘₯ = 0.51
β„Ž3 π‘₯ = 0.71
Computation: 𝑂(π‘˜)
Whether sketch is actually updated or not.
Min-Hash Sketches: Example
k-partition π’Œ = πŸ‘
32 12
π‘₯
3
𝑖(π‘₯) 2
14
7
6
4
1
1
2
3
β„Ž(π‘₯)
0.07 0.70 0.55 0.20
0.19
0.51
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.07 ,
0.19 , 0.20 )
part-hash
value-hash
Min-Hash Sketches: k-partition
k-partition sketch: Use a single hash function: β„Ž′
Use the first log 2 π‘˜ bits of β„Ž′(π‘₯) to map π‘₯ uniformly to one of π‘˜
parts. Call the remaining bits β„Ž(x).
For 𝑖 = 1, … , π‘˜ : Track the minimum hash value 𝑦𝑖 of the elements
in part 𝑖.
Processing a new element π‘₯ :
 𝑖 ← first log 2 π‘˜ bits of β„Ž′(π‘₯)
 β„Ž ← remaining bits of β„Ž′(π‘₯)
 𝑦𝑖 ← min{𝑦𝑖 , β„Ž}
𝑖 π‘₯ =2
β„Ž π‘₯ = 0.19
𝑦2 ← min{𝑦2 , 0.19}
Computation: 𝑂(1) to test or update
Min-Hash Sketches: Example
Bottom-k π’Œ = πŸ‘
π‘₯
32
12
14
β„Ž(π‘₯)
0.19
0.51
0.07 0.70 0.55 0.20
7
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.07 ,
6
4
0.19 , 0.20 )
Min-Hash Sketches: bottom-k
Bottom-k sketch: Use a single hash function: β„Ž
Track the π‘˜ smallest values 𝑦1 < 𝑦2 < β‹― < π‘¦π‘˜
Processing a new element π‘₯ :
If β„Ž π‘₯ < yk :
(𝑦1 , … , π‘¦π‘˜ ) ← sort{𝑦1 , … , π‘¦π‘˜−1 , β„Ž(π‘₯)}
Computation:
The sketch (𝑦1 , … , π‘¦π‘˜ ) is maintained as a sorted list or as a
priority queue.
 𝑂(1) to test if an update is needed
 𝑂(π‘˜) to update a sorted list. 𝑂(log π‘˜) to update a priority
queue.
We will see that #changes β‰ͺ #distinct elements
Min-Hash Sketches: Number of updates
Claim: The expected number of actual updates
(changes) of the min-hash sketch is O(π‘˜ ln 𝑛)
Proof: First Consider π’Œ = 𝟏. Look at distinct elements in the
order they first occur.
The π’Šth distinct element has lower hash value than the current
𝟏
minimum with probability . This is the probability of being
𝐒
first in a random permutation of 𝑖 elements.
⟹ Total expected number of updates is
𝑛 1
𝑖=1 𝑖
= 𝐻𝑛 ≤ ln 𝑛.
32, 12, 14, 32, 7, 12, 32, 7, 6,
Update
Prob.
1
1
2
1
3
0
1
4
0
0
0
1
5
12, 4,
0
1
6
Min-Hash Sketches: Number of updates
Claim: The expected number of actual updates
(changes) of the min-hash sketch is O(π‘˜ ln 𝑛)
Proof (continued): Recap for π‘˜ = 1 (single min-hash
value): the 𝑖 th distinct element causes an update with
1
𝑛 1
probability ⟹ expected total is 𝑖=1 ≤ ln 𝑛.
i
𝑖
k-mins: π‘˜ min-hash values (apply π‘˜ times)
Bottom-k: We keep the π‘˜ smallest elements, so update
π‘˜
th
probability of the 𝑖 distinct element is min{1, }
𝑖
(probability of being in the first π‘˜ in a random permutation)
k-partition: π‘˜ min-hash values for ≈ 𝑛/π‘˜ distinct values.
Merging Min-Hash Sketches
!! We apply the same set of hash function to all
elements/data sets/streams.
The union sketch π’š from sketches of two sets π’š’,π’š’’:
 k-mins: take minimum per hash function
𝑦𝑖 ← min {𝑦𝑖′ , 𝑦𝑖′′ }
 k-partition: take minimum per part 𝑦i ← min {𝑦𝑖′ , 𝑦𝑖′′ }
 Bottom-k: The π‘˜ smallest in union of data must be in
the π‘˜ smallest of their own set:
{𝑦1 , … , π‘¦π‘˜ } = bottomπ‘˜{𝑦1′ , … , π‘¦π‘˜′ , 𝑦1′′ , … , π‘¦π‘˜′′ }
Using Min-Hash Sketches
Recap:
 We defined Min-Hash Sketches (3 types)
 Adding elements, merging Min-Hash sketches
 Some properties of these sketches
Next: We put Min-Hash sketches to work
 Estimating Distinct Count from a Min-Hash
Sketch
 Tools from estimation theory
The Exponential Distribution Exp(𝑛)
−𝑛π‘₯
 PDF 𝑛e
−𝑛π‘₯
, π‘₯ ≥ 0 ; CDF 1 − e
; πœ‡=𝜎=
 Very useful properties:
οƒ˜Memorylessness:
∀𝑑, 𝑦 ≥ 0, Pr π‘₯ > 𝑦 + 𝑑 π‘₯ > 𝑦] = Pr π‘₯ > 𝑑
οƒ˜Min-to-Sum conversion:
min Exp 𝑛1 , … , Exp 𝑛𝑑 ∼ Exp(𝑛1 + β‹― + 𝑛𝑑 )
 Relation with uniform:
ln 1−𝑒
−
𝑛
−𝑛π‘₯
𝑒 ∼ π‘ˆ 0,1 ⇔
π‘₯ ∼ Exp(𝑛) ⇔ 1 − e
∼ Exp(𝑛)
∼ π‘ˆ 0,1
1
n
Estimating Distinct Count from a MinHash Sketch: k-mins
• Change to exponential distribution β„Ž π‘₯ ∼ Exp(1)
• Using Min-to-Sum property, 𝑦𝑖 ∼ Exp(𝑛)
– In fact, we can just work with β„Ž π‘₯ ∼ U[0,1] and use
𝑦𝑖 ← −ln 1 − 𝑦𝑖 when estimating.
• Number of distinct elements becomes a parameter
estimation problem:
Given π‘˜ independent samples from Exp(𝑛) ,
estimate 𝑛
Estimating Distinct Count from a MinHash Sketch: k-mins
1
𝑛
 Each 𝑦𝑖 ∼ Exp(𝑛) has expectation and variance
2
𝜎 =
1
.
2
𝑛
π‘˜
𝑖=1 𝑦𝑖
 The average π‘Ž =
has expectation πœ‡ =
π‘˜
1
𝜎
2
variance 𝜎 = 2 . The cv is = 1/ π‘˜ .
π‘˜π‘›
πœ‡
 π‘Ž is a good unbiased estimator
 But
1
𝑛
1
for
𝑛
which is the inverse of what we want.
What about estimating 𝑛 ?
1
𝑛
and
Estimating Distinct Count from a MinHash Sketch: k-mins
What about estimating 𝑛 ?
1) We can use the biased estimator
1
π‘Ž
=
π‘˜
π‘˜
𝑖=1 𝑦𝑖
 To say something useful on the estimate quality: We
apply Chebyshev’s inequality to bound the probability
1
1
that π‘Ž is far from its expectation and thus is far
n
π‘Ž
from 𝑛
2) Maximum Likelihood Estimation (general and
powerful technique)
Chebyshev’s Inequality
For any random variable with expectation πœ‡
and standard deviation 𝜎, for any 𝑐 ≥ 1
1
Pr π‘₯ − πœ‡ ≥ π‘πœŽ ≤ 2
𝑐
For π‘Ž, πœ‡ =
For πœ– <
Pr
1
π‘Ž
1
,
2
1
𝑛
,𝜎 =
1
𝑛 π‘˜
− 𝑛 ≥ πœ– 𝑛 ≤ Pr π‘Ž −
Using 𝑐 =
πœ– π‘˜
2
1
𝑛
≥
πœ–
2𝑛
≤
4
πœ–2 π‘˜
Using Chebyshev’s Inequality
For 0 < πœ– <
Pr
1
π‘Ž
1
,
2
πœ–
2
1− >
1
1+πœ–
;
1
1−πœ–
>1+πœ– >1+
1
π‘Ž
πœ–
2
− 𝑛 ≥ πœ– 𝑛 = 1 − Pr −πœ–π‘› ≤ − 𝑛 ≤ πœ– 𝑛 =
1
1 − Pr 𝑛(1 − πœ–) ≤ ≤ (1 + πœ–) 𝑛 =
π‘Ž
1 1
1 1
1 − Pr
≥π‘Ž≥
≤
𝑛 1−πœ–
1+πœ–π‘›
1
πœ–
πœ– 1
1 − Pr
1+
≥ π‘Ž ≥ (1 − )
𝑛
2
2 𝑛
1
πœ–
= Pr π‘Ž − ≥
𝑛
2𝑛
Maximum Likelihood Estimation
Set of independent 𝑦𝑖 ∼ Fi (πœƒ) ; we do not know πœƒ
The MLE πœƒπ‘€πΏπΈ is the value that maximizes the
likelihood (joint density) function 𝑓(𝑦; πœƒ). The
maximum over πœƒ of the probability of observing {𝑦𝑖 }
Properties:
 Principled way of deriving estimators
 Converges in probability to true value (with
enough i.i.d samples)… but generally biased
 (Asymptotically!) optimal – minimizes MSE
(mean square error) – meets Cramér-Rao lower
bound
Estimating Distinct Count from a MinHash Sketch: k-mins MLE
Given π‘˜ independent samples from Exp(𝑛) ,
estimate 𝑛
 Likelihood function for yi (joint density function):
−𝑛𝑦𝑖
π‘˜
𝑖=1 𝑛e
k −n π‘˜
𝑖=1 𝑦𝑖
=n e
𝑓 𝑦; 𝑛 =
 Take a logarithm (does not change the maximum):
β„“ 𝑦1 , … , π‘¦π‘˜ ; 𝑛 = log 𝑓 𝑦; 𝑛 = π‘˜ ln 𝑛 − 𝑛 π‘˜π‘–=1 𝑦𝑖
 Differentiate to find maximum:
 MLE estimate 𝑛𝑀𝐿𝐸 =
π‘˜
πœ•β„“ 𝑦;𝑛
πœ•π‘›
π‘˜
𝑛
= −
π‘˜
𝑖=1 𝑦𝑖
=0
π‘˜
𝑖=1 𝑦𝑖
We get the same estimator, depends only on the sum!
Given π‘˜ independent samples from Exp(𝑛) ,
estimate 𝑛
We can think of several ways to combine and
use these π‘˜ samples and decrease the variance:
• average (sum)
• median
• remove outliers and average remaining, …
We want to get the most value (best estimate) from
the information we have (the sketch).
What combinations should we consider ?
Sufficient Statistic
A function T y = T y1 , … , π‘¦π‘˜ is a sufficient
statistic for estimating some function of the
parameter πœƒ if the likelihood function has the
factored form 𝑐 𝑦 𝑔(𝑇 𝑦 ; πœƒ)
Likelihood function (joint density) for π‘˜
exponential i.i.d random variables from Exp(𝑛) :
π‘˜
𝑓 𝑦; 𝑛 =
𝑛e
−𝑛𝑦𝑖
k −n π‘˜
𝑖=1 𝑦𝑖
=n e
𝑖=1
⇒ The sum
π‘˜
𝑖=1 𝑦𝑖
is a sufficient statistic for 𝑛
Sufficient Statistic
A function T y = T y1 , … , π‘¦π‘˜ is a sufficient
statistic for estimating some function of the
parameter πœƒ if the likelihood function has the
factored form 𝑓 𝑦; πœƒ = 𝑐 𝑦 𝑔(𝑇 𝑦 ; πœƒ)
In particular: The MLE depends on 𝑦 only through 𝑇(𝑦)
 The maximum with respect to πœƒ does not depend on
𝑐 𝑦 .
 The maximum of 𝑔(𝑇 𝑦 ; πœƒ) , computed by deriving
with respect to πœƒ, is a function of T 𝑦 .
Sufficient Statistic
T y = T y1 , … , π‘¦π‘˜ is a sufficient statistic for πœƒ
if the likelihood function has the form 𝑓 𝑦; πœƒ =
𝑐 𝑦 𝑔(𝑇 𝑦 ; πœƒ)
Lemma: 𝑇 𝑦 sufficient ⟺ Conditional
distribution of 𝑦 given 𝑇(𝑦) does not depend on πœƒ
If we fix 𝑇 𝑦 , the density function is 𝑓 𝑦; πœƒ ∝
𝑐 𝑦
If we know the density up to fixed factor, it is
determined completely by normalizing to 1
Rao-Blackwell Theorem
Recap: T y is a sufficient statistic for πœƒ ⟺
Conditional distribution of 𝑦 given 𝑇(𝑦) does
not depend on πœƒ
Rao-Blackwell Theorem: Given an estimator 𝑏(𝑦) of
πœƒ that is not a function of the sufficient statistic, we
can get an estimator with at most the same MSE
that depends only on 𝑇(𝑦): 𝐸[𝑏(𝑦)|𝑇(𝑦)]
 𝐸[𝑏(𝑦)|𝑇(𝑦)] does not depend on πœƒ (critical)
 Process is called: Rao-Blackwellization of 𝒃(π’š)
Rao-Blackwell Theorem
𝑓(𝑦1 , 𝑦2 ; πœƒ)
(1,3)
Density function of 𝑦1 , 𝑦2 given
parameter πœƒ
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
Sufficient statistic:
𝑓(𝑦1 , 𝑦2 ; πœƒ)
(1,3)
T 𝑦1 , 𝑦2 = y1 + y2
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
𝑓(𝑦1 , 𝑦2 ; πœƒ)
Sufficient statistic:
T 𝑦1 , 𝑦2 = y1 + y2
(1,3)
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
𝑓(𝑦1 , 𝑦2 ; πœƒ)
Sufficient statistic:
𝑓 𝑦1 , 𝑦2 ; πœƒ |y1 + y2 T 𝑦1 , 𝑦2 = y1 + y2
(1,3)
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
Estimator 𝜽(π’šπŸ , π’šπŸ )
3 (1,3)
0 (4,0)
2 (2,2)
2 (1,2)
1 (3,1)
T 𝑦1 , 𝑦2 = y1 + y2
1 (2,1)
0 (3,0)
2 (3,2)
4 (1,4)
Rao-Blackwell Theorem
𝜽(π’šπŸ , π’šπŸ ) T 𝑦1 , 𝑦2 = y1 + y2
Rao-Blackwell: 𝜽′ = 𝑬[𝜽 π’šπŸ , π’šπŸ |π’šπŸ + π’šπŸ ]
3 (1,3)
0
1 (2,1)
2 (2,2)
1.5
(4,0)
1
2 (1,2)
1 (3,1)
0 (3,0)
2 (3,2)
4 (1,4)
3
Rao-Blackwell Theorem
𝜽(π’šπŸ , π’šπŸ ) T 𝑦1 , 𝑦2 = y1 + y2
Rao-Blackwell: 𝜽′ = 𝑬[𝜽 π’šπŸ , π’šπŸ |π’šπŸ + π’šπŸ ]
 Law of total expectation:
𝐄[𝜽′] = 𝑬[𝜽]
Expectation (bias) remains the same
 MSE (Mean Square Error) can only decrease
′
MSE 𝜽 ≤ MSE[𝜽]
Why does the MSE decrease?
 Suppose we have two points with equal
probabilities. We have an estimator of πœ‡ that
gives estimates π‘Ž and 𝑏 on these points.
 We replace it by an estimator that instead
π‘Ž+𝑏
returns the average:
2
 The (scaled) contribution of these two points
to the square error changes from
π‘Ž−πœ‡
2
+ 𝑏−πœ‡
2
to 2
π‘Ž+𝑏
2
−πœ‡
2
Why does the MSE decrease?
Show that
π‘Ž−πœ‡
2
+ 𝑏−πœ‡
2
π‘Ž+𝑏
≥2
−πœ‡
2
2
Sufficient Statistic for estimating 𝑛
from k-mins sketches
Given π‘˜ independent samples from Exp(𝑛) ,
estimate 𝑛
 𝑓 𝑦; 𝑛 =
−𝑛𝑦𝑖
π‘˜
𝑖=1 𝑛e
k −n π‘˜
𝑖=1 𝑦𝑖
=n e
 The sum π‘˜π‘–=1 𝑦𝑖 is a sufficient statistic for estimating
1
any function of 𝑛 (including 𝑛, , n2 )
𝑛
 Rao-Blackwell ⇒ We can not gain by using estimators
with a different dependence on {𝑦𝑖 } (e.g. functions
of median or of a smaller sum)
Estimating Distinct Count from a MinHash Sketch: k-mins MLE
MLE estimate 𝑛𝑀𝐿𝐸 =
π‘˜
π‘˜
𝑖=1 𝑦𝑖
• π‘₯ = π‘˜π‘–=1 𝑦𝑖 , the sum of i.i.d ∼ Exp(𝑛) random
variables), has PDF
π‘˜−1
𝑛π‘₯
π‘“π‘˜,𝑛 (π‘₯) = 𝑛e−𝑛π‘₯
π‘˜−1 !
The expectation of the MLE estimate is
∞
π‘˜
π‘˜
π‘“π‘˜,𝑛 π‘₯ 𝑑π‘₯ =
𝑛
π‘˜−1
0 π‘₯
Estimating Distinct Count from a MinHash Sketch: k-mins
Unbiased Estimator 𝑛 =
π‘˜−1
π‘˜
𝑖=1 𝑦𝑖
(for π‘˜ > 1)
The variance of the unbiased estimate is
∞
2
π‘˜
−
1
1
2
2
2
𝜎 =
𝑓
π‘₯
𝑑π‘₯
−
𝑛
=
𝑛
π‘˜,𝑛
2
π‘₯
π‘˜−2
0
The CV is
𝜎
πœ‡
=
1
π‘˜−2
Is this the best we can do ?
Cramér-Rao lower bound (CRLB)
Are we using the information in the sketch
in the best possible way ?
Cramér-Rao lower bound (CRLB)
Information theoretic lower bound on the
variance of any unbiased estimator πœƒ of πœƒ.
Likelihood function: 𝑓 𝑦; πœƒ
Log likelihood:
β„“ 𝑦; πœƒ = ln 𝑓 𝑦; πœƒ
Fisher Information
πœ• 2 β„“ 𝑦; πœƒ
𝐼 πœƒ = −E
πœ•πœƒ 2
CRLB: Any unbiased estimator has V πœƒ ≥
1
𝐼 πœƒ
CRLB for estimating 𝑛
 Likelihood function for n, y = yi

𝑓 𝑦; 𝑛 =
−𝑛𝑦𝑖
π‘˜
𝑖=1 𝑛𝑒
k −n π‘˜
𝑖=1 𝑦𝑖
=n e
 Log likelihood β„“ 𝑦; 𝑛 = π‘˜ln 𝑛 − n
 Negated second
πœ•2 β„“ 𝑦;𝑛
derivative: 2
πœ• 𝑛
 Fisher information: I n =
 CRLB : var 𝑛 ≥
1
𝐼 𝑛
=
𝑛2
π‘˜
π‘˜
𝑖=1 𝑦𝑖
=
π‘˜
− 2
𝑛
πœ•2 β„“ 𝑦;𝑛
−E[ 2
πœ• 𝑛
]=
π‘˜
𝑛2
Estimating Distinct Count from a MinHash Sketch: k-mins
Unbiased Estimator 𝑛 =
Our estimator has CV
π‘˜−1
π‘˜
𝑖=1 𝑦𝑖
(for π‘˜ > 1)
1
π‘˜−2
The Cramér-Rao lower bound on CV is
1
π‘˜
⇒
we are using the information in the sketch nearly
optimally !
Estimating Distinct Count from a MinHash Sketch: Bottom-k
Bottom-k sketch 𝑦1 < 𝑦2 < β‹― < π‘¦π‘˜
Can we specify the distribution? Use Exponential D.
 π‘˜ = 1 same as k-mins 𝑦1 ∼ Exp(𝑛)
 The minimum 𝑦2 of the remaining 𝑛 − 1
elements is Exp 𝑛 − 1 | 𝑦2 > 𝑦1 . Since
memoryless, 𝑦2 − 𝑦1 ∼ Exp(𝑛 − 1).
 More generally 𝑦𝑖+1 − 𝑦𝑖 ∼ Exp(𝑛 − 𝑖).
What is the relation with k-mins sketches?
Bottom-k versus k-mins sketches
Bottom-k sketch: samples from
Exp 𝑛 , Exp 𝑛 − 1 , … , Exp(𝑛 − π‘˜ + 1)
K-mins sketch: π‘˜ samples from Exp 𝑛
To obtain π‘₯ ∼ Exp 𝑛 from 𝑧 ∼ Exp 𝑛 − 𝑖 (without
knowing 𝑛) we can take min{𝑧, 𝑑} where 𝑑 ∼ Exp 𝑖
We can use k-mins estimators with bottom-k. Can do
even better by taking expectation over choices of 𝑑.
Bottom-k sketches carry strictly more
information than k-mins sketches!
Estimating Distinct Count from a MinHash Sketch: Bottom-k
Likelihood function of 𝑦1 , … , π‘¦π‘˜ , 𝑛:
π‘˜
−(𝑛+1−𝑖)(𝑦𝑖 −𝑦𝑖−1 )
𝑓 𝑦; 𝑛 =
(𝑛 + 1 − 𝑖)e
=
𝑖=1
n!
e−(n+1)
n−k !
=
π‘˜
𝑖=1(𝑦𝑖 −𝑦𝑖−1 )
π‘˜π‘¦π‘˜ − π‘˜−1
𝑖=1 𝑦𝑖
e
Does not depend on n
e
π‘˜
𝑖 𝑖
n!
e−
n−k !
𝑦𝑖 −𝑦𝑖−1
n+1 π‘¦π‘˜
Depends on n
What does estimation theory tell us?
=
Estimating Distinct Count from a MinHash Sketch: Bottom-k
What does estimation theory tell us?
Likelihood function
𝑛!
π‘˜π‘¦π‘˜ − π‘˜−1
𝑦
𝑖=1 𝑖
𝑓 𝑦; 𝑛 = e
e−
𝑛−π‘˜ !
𝑛+1 π‘¦π‘˜
π‘¦π‘˜ (maximum value in the sketch) is a sufficient
statistic for estimating 𝑛 (or any function of 𝑛 ).
Captures everything we can glean from the
bottom-k sketch on 𝑛
Bottom-k: MLE for Distinct Count
Likelihood function (probability density) is
𝑛!
π‘˜π‘¦π‘˜ − π‘˜−1
𝑦
𝑖=1 𝑖
𝑓 𝑦; 𝑛 = e
e− 𝑛+1
𝑛−π‘˜ !
π‘¦π‘˜
Find the value of 𝑛 which maximizes 𝑓 𝑦; 𝑛 :
Look only at part that depends on 𝑛
Take the logarithm (same maximum)
π‘˜−1
β„“ 𝑦; 𝑛 =
ln(𝑛 − 𝑖) − 𝑛 + 1 π‘¦π‘˜
𝑖=0
Bottom-k: MLE for Distinct Count
We look for 𝑛 which maximizes
π‘˜−1
β„“ 𝑦; 𝑛 =
ln(𝑛 − 𝑖) − 𝑛 + 1 π‘¦π‘˜
𝑖=0
πœ•β„“ 𝑦; 𝑛
=
πœ•π‘›
π‘˜−1
𝑖=0
1
− π‘¦π‘˜
𝑛−𝑖
π‘˜−1
MLE is the solution of:
Need to solve numerically
𝑖=0
1
= π‘¦π‘˜
𝑛−𝑖
Summary: k-mins count estimators
 k-mins sketch with U 0,1 dist: 𝑦1 ′, … , π‘¦π‘˜ ′
 With Exp dist: 𝑦1 , … , π‘¦π‘˜
𝑦𝑖 = −ln(1 − 𝑦𝑖′ )
 Sufficient statistic for (any function of) 𝑛 :
 MLE/Unbiased est for
 MLE for 𝑛:
1
:
𝑛
π‘˜
𝑖=1 𝑦𝑖
π‘˜
cv:
1
π‘˜
CRLB:
π‘˜
π‘˜
𝑖=1 𝑦𝑖
 Unbiased est for 𝑛:
k−1
π‘˜
𝑖=1 𝑦𝑖
cv:
1
1
CRLB:
π‘˜−2
π‘˜
π‘˜
𝑖=1 𝑦𝑖
1
π‘˜
Summary: bottom-k count estimators
 Bottom-k sketch with U 0,1 :
𝑦1′
<β‹―<
′
π‘¦π‘˜
 With Exp dist: 𝑦1 < β‹― < π‘¦π‘˜ 𝑦𝑖 = −ln(1 −
𝑦𝑖′ )
 Sufficient statistic for (any function of) 𝑛 : π‘¦π‘˜
 Contains strictly more information than k-mins
 When 𝑛 ≫ π‘˜, approximately the same as k-mins
π‘˜−1
 MLE for 𝑛 is the solution of:
1
= π‘¦π‘˜
𝑛−𝑖
𝑖=0
Bibliography
•
•
•
•
•
See lecture 3
We will continue with Min Hash sketches
Use as random samples
Applications to similarity
Inverse-Probability based distinct count
estimators
Download