lecture3 - Cohen

advertisement
Leveraging Big Data: Lecture 3
http://www.cohenwang.com/edith/bigdataclass2013
Instructors:
Edith Cohen
Amos Fiat
Haim Kaplan
Tova Milo
Overview: More on Min-Hash Sketches
 Subset/Selection size queries from random
samples
 Min-hash sketches as samples
Other uses of the sampling “view” of sketches:
 Sketch-based similarity estimation
 Inverse-probability distinct count estimators
 Min-hash sketches on a small range (fewer bits)
How samples are useful
We often want to know more than just the number of
distinct elements :
 How many distinct search queries (or distinct
query/location pairs)…
 Involve the recent election?
 Are related to flu ?
 Reflect financial uncertainty ?
 How many distinct IP flows going through our
network…
 use a particular protocol ?
 are originated from a particular location ?
Such subset queries are specified by a predicate. They can be
answered approximately from the sample.
Min-hash Sketches as Random Samples
A min-hash sketch as a random sample:
A distinct element 𝑥 is sampled if it “contributed” to
the sketch 𝑠 𝑁 ∖ 𝑥 ≠ s 𝑁
 To facilitate subset queries, we need to retain
meta-data/IDs of sampled elements.
 Min-hash samples can be efficiently computed
 over data streams
 over distributed data (using mergeability)
K-mins sketch as a sample
k-mins 𝒌 = 𝟑
32 12 14 7 6 4
𝑥
ℎ1 (𝑥) 0.45 0.35 0.74 0.21 0.14 0.92
ℎ2 (𝑥) 0.19
ℎ3 (𝑥) 0.10
0.51
0.07 0.70 0.55 0.20
0.71
0.93 0.50 0.89 0.18
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.14 , 0.07 , 0.10 )
k-mins sample: (6,14, 32)
Sampling scheme: k times with replacement
k-partition sketch as a sample
k-partition 𝒌 = 𝟑
32 12
𝑥
3
𝑖(𝑥) 2
14
7
6
4
1
1
2
3
ℎ(𝑥)
0.07 0.70 0.55 0.20
0.19
0.51
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.07 ,
part-hash
value-hash
0.19 , 0.20 )
k-partition sample: (14 , 32 , 4)
Sampling scheme: throw elements into 𝑘 buckets
Choose one uniformly from each nonempty bucket
Bottom-k sketch as a sample
Bottom-k 𝒌 = 𝟑
𝑥
32
12
14
ℎ(𝑥)
0.19
0.51
0.07 0.70 0.55 0.20
(𝑦1 , 𝑦2 , 𝑦3 ) = { 0.07 ,
7
6
4
0.19 , 0.20 }
Bottom-k sample: {14 , 32 , 4}
Sampling scheme: choose 𝑘 without replacement
Selection/Subset queries
from min-hash samples
 𝒌′ ≤ 𝒌 distinct elements sampled
 The sample is exchangeable (fixing the sample
size, all subsets are equally likely).
 When 𝐧 ≫ 𝒌 all three schemes are similar.
Let 𝑃 be the subset of elements satisfying our
selection predicate. We want to estimate
 The number |𝑷 ∩ 𝑵| of distinct elements
satisfying the predicate or
 Their fraction
|𝑷∩𝑵|
|𝑵|
≡𝜶
Subset queries: k-mins samples
One uniform sample 𝑥 ∈ 𝑁 has probability 𝛼 to be from 𝑃.
Its “presence” I𝑥∈𝑃 is 1 with probability 𝛼 and 0 with
probability 1 − 𝛼.
The expectation and variance of I𝑥∈𝑃 are
𝝁=𝜶⋅𝟏+ 𝟏−𝜶 ⋅𝟎=𝜶
𝝈𝟐 = 𝜶 ⋅ 𝟏𝟐 + 𝟏 − 𝜶 ⋅ 𝟎𝟐 − 𝝁𝟐 = 𝜶(𝟏 − 𝜶)
 Our estimator for a k-mins sample (𝑥1 , … , 𝑥𝑘 ) (𝑘
times with replacement) is: 𝜶 =
 Expectation: 𝜇 = 𝛼
𝒌
𝒊=𝟏 𝑰𝒙𝒊 ∈𝑷
𝒌
𝛼 1−𝛼
2
Variance: 𝜎 =
k
Subset queries:
bottom-k and k-partition samples
Sampling is without replacement:
 Exactly 𝒌′ = 𝒌 times with bottom-k
 𝟏 ≤ 𝒌′ ≤ 𝒌 times with k-partition (𝑘’ is the number
of nonempty “buckets” when tossing 𝑛 balls into 𝑘
buckets)
|𝑷∩𝑺|
|𝑷∩𝑺|
We use the estimator: 𝜶 =
=
|𝑺|
𝒌′
|𝑷∩𝑵|
The expectation is: 𝐄 𝜶 =
≡𝜶
|𝑵|
The Variance (Conditioned on 𝑘 ′ ) is:
′−𝟏
𝜶
𝟏
−
𝜶
𝒌
𝝈𝟐 =
(𝟏 −
)
𝒌′
𝒏−𝟏
we show:
Expectation of 𝛼
(k-partition and bottom-k)
We condition on the number of sampled (distinct) elements
𝒌′ ≥ 𝟏: Consider the “positions” i = 1, … , 𝑘′ in the sample
and their “contributions” 𝑻𝒊 to 𝛼. We have 𝜶 = 𝒌𝒊=𝟏 𝑻𝒊 .
If a position 𝒊 gets an element 𝒙 ∈ 𝑷 ∩ 𝑵 (probability 𝜶),
𝟏
then Ti = ′ . Otherwise, Ti = 0. Therefore,
𝒌
E 𝑻𝒊 =
𝟏 𝟐
𝜶𝟐
𝜶 𝟏−𝜶
=
Var 𝑻𝒊 = 𝜶 ′ − ′ 𝟐 = ′𝟐
𝒌
𝒌
𝒌
of expectation 𝐄[𝜶] = 𝒌′
𝒊=𝟏 𝐄[𝑻𝒊 ] = 𝜶
𝟏
𝜶
𝒌′
𝜶
𝒌′
From linearity
k-partition: Since this is the expectation for every possible
𝒌′ , it is also the expectation overall.
Variance of 𝛼
(k-partition and bottom-k)
Conditioned on 𝒌′ ≥ 𝟏 ∶ Var 𝜶 =
For 𝒊 ≠ 𝒋, Cov[𝑻i , 𝑻𝒋 ] = E 𝑻𝒊 𝑻𝒋 −
=𝛼
𝛼𝑛−1 1
𝑛−1 𝑘 ′2
𝛼2
− ′2
𝑘
=−
Cov[𝑻𝑖 , 𝑻𝒊 ] = Var 𝑻𝒊 =
Var 𝛼 = 𝑘
′
𝜶𝟐
𝒌′𝟐
=
𝛼(1−𝛼)
n−1 k′2
𝜶 𝟏−𝜶
𝛼 1−𝛼
2
𝒊,𝒋∈{𝟏,…,𝒌′ } Cov[𝑻𝒊 , 𝑻𝒋 ]
𝒌′
𝟐
𝛼 1−𝛼
−𝑘 𝑘 −1
n − 1 k ′2
𝑘′
1−𝛼
𝑘′ − 1
=
(1 −
)
′
𝑘
𝑛−1
′
′
Subset estimation: Summary
For any predicate, we obtain an unbiased
estimator 𝜶 of the fraction 𝜶 =
with standard deviation 𝝈 ≤
|𝑷∩𝑵|
|𝑵|
𝟏
𝟐𝒌
∈ [𝟎, 𝟏]
…
 More accurate when 𝜶 is close to 0 or to 1
 With bottom-k more accurate when 𝑛 =
𝑂(𝑘)
Next:
Sketch-based similarity estimation




Applications of similarity
Modeling using features
Scalability using sketches
Terms and shingling technique for text
documents.
 Jaccard and cosine similarity
 Sketch-based similarity estimators
Search example
User issues a query (over images, movies, text
document, Webpages)
Search engine finds many matching documents:
Doc 2′
Doc 2
Doc 3′′
Doc 1
Doc 1′
Doc 1′′
Doc 3′
Doc 2′′
Doc 3
Elimination of near duplicates
A lot of redundant information – many documents
are very similar. Want to eliminate near-duplicates
Doc 2′
Doc 2
Doc 3′′
Doc 1
Doc 1′
Doc 1′′
Doc 3′
Doc 2′′
Doc 3
Elimination of near duplicates
A lot of redundant information – many documents
are very similar. Want to eliminate near-duplicates
Doc 2′
Doc 1
Doc 3
Elimination of near duplicates
Return to the human user a concise, informative,
result.
Doc 1
Doc 2′
Doc 3
Return to user
Identifying similar documents in a
collection of documents (text/images)
 Why is similarity interesting ?




Search (query is also treated as a “document”)
Find text documents on a similar topic
Face recognition
Labeling documents (collection of images, only
some are labeled, extend label from similarity)
 ….
Identifying near-duplicates
(very similar documents)
 Why do we want to find near-duplicates ?
 Plagiarism
 Copyright violations
 Clean up search results
 Why we find many near-duplicates ?
 Mirror pages
 Variations on the same source
 Exact match is easy: use hash/signature
Document Similarity
Modeling:
 Identify a set of features for our similarity
application.
 Similar documents should have similar features:
similarity is captured by the similarity of the
feature sets/vectors (use a similarity measure)
 Analyse each document to extract the set of
relevant features
Sketch-based similarity:
Making it scalable
 Sketch the set of features of each document
such that similar sets imply similar sketches
 Estimate similarity of two feature sets from the
similarity of the two sketches
Doc 1
Doc 2
(0,0,1,0,1,1,0…)
(1,0,1,1,1,1,0,…)
Sketch 1
Sketch 2
Similarity of text documents
 What is a good set of features ?
Approach:
 Features = words (terms)
 View each document as a bag of words
 Similar documents have similar bags
 This works well (with TF/IDF weighting…) to
detect documents on a similar topic.
 It does not geared for detecting nearduplicates.
Shingling technique for text documents
(Web pages) [Broder 97]
For a parameter 𝒕:
 Each feature corresponds to a 𝒕-gram (shingle): an
ordered set of 𝒕 “tokens” (words)
 Very similar documents have similar sets of
features (even if sentences are shifted, replicated)
All 3-shingles in title:
technique for text Shingling technique for
text
documents
Web
for text documents
documents Web pages
Similarity measures
We measure similarity of two documents by the
similarity of their feature sets/vectors
Comment: will focus on sets/binary vectors today.
In general, we sometimes want to associate
“weights” with presence of features in a document
 Two popular measures are
 The Jaccard coefficient
 Cosine similarity
Jaccard Similarity
A common similarity measure of two sets
Features 𝑁1 of
document 1
Features 𝑁2 of
document 2
Ratio of size of intersection
to size of union:
𝐽 𝑁1 , 𝑁2
|𝑁1 ∩ 𝑁2 |
=
|𝑁1 ∪ 𝑁2 |
3
𝐽 = = 0.375
8
Comment: Weighted Jaccard
Similarity of weighted (nonnegative) vectors
Sum of min over
sum of max
𝐽 𝑉, 𝑈 =
𝑖 min{𝑉𝑖 , 𝑈𝑖 }
𝑖 max{𝑉𝑖 , 𝑈𝑖 }
𝑉 = (0.00, 0.23, 0.00, 0.00, 0.03, 0.00, 1.00,0.13)
𝑈 = (0.34, 0.21, 0.00, 0.03, 0.05, 0.00,1.00, 0.00)
min = (0.00, 0.21, 0.00, 0.00, 0.03, 0.00, 1.00, 0.00)
max = (0.34, 0.23, 0.00, 0.03, 0.05, 0.00, 1.00, 0.13)
1.24
𝐽 𝑉, 𝑈 =
1.78
Cosine Similarity
𝜃
Similarity measure between two vectors: The
cosine of the angle between the two vectors.
C 𝑈, 𝑉 =
Euclidean Norm:
𝑉⋅𝑈
𝑉
V
2
2
=
𝑈
2
2
𝑉
𝑖 𝑖
Cosine Similarity (binary)
𝜃
View each set 𝑁 ′ ⊂ 𝑁 as a vector 𝑣(𝑁 ′ ) with
entry to each element in the domain 𝑁
𝑖 ∈ 𝑁 ′ ⇔ 𝑣i 𝑁 ′ = 1
𝑖 ∉ 𝑁 ′ ⇔ 𝑣i 𝑁 ′ = 0
Cosine similarity between 𝑁1 and 𝑁2 :
𝑣 𝑁1 ⋅ 𝑣(𝑁2 )
𝑁1 ∩ 𝑁2
C 𝑁1 , 𝑁2 =
=
𝑣 𝑁1 2 𝑣 𝑁2 2
𝑁1 |𝑁2 |
𝐶=
3
5 6
≈ 0.55
Estimating Similarity of sets using
their Min-Hash sketches
 We sketch all sets using the same hash functions.
There is a special relation between the sketches:
We say the sketches are “coordinated”
 Coordination is what allows the sketches to be
mergeable. If we had used different hash
functions for each set, the sketches would not have
been mergeable.
 Coordination also implies that similar sets have
similar sketches (LSH property). This allows us to
obtain good estimates of the similarity of two sets
from the similarity of sketches of the sets.
Jaccard Similarity from Min-Hash sketches
𝐽 𝑁1 , 𝑁2
|𝑁1 ∩ 𝑁2 |
=
|𝑁1 ∪ 𝑁2 |
For each 𝑁𝑖 we have a Min-Hash sketch 𝑠(𝑁𝑖 )
(use the same hash function/s ℎ for all sets)
 Merge 𝑠(𝑁1 ) and 𝑠(𝑁2 ) to obtain 𝑠(𝑁1 ∪ 𝑁2 )
 For each 𝑥 ∈ s(N1 ∪ N2 ) we know everything on its
membership in 𝑁1 or 𝑁2 :
𝒙 ∈ 𝒔(𝑵𝟏 ∪ 𝑵𝟐 ) is in 𝑵𝒊 if and only if 𝒙 ∈ 𝒔(𝑵𝒊 )
 In particular, we know if 𝑥 ∈ 𝑁1 ∩ 𝑁2
 𝐽 is the fraction of union members that are intersection
members: apply subset estimator to 𝑠(𝑁1 ∪ 𝑁2 )
k-mins sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = (0.22, 0.11, 0.14, 0.22)
𝑠 𝑁2 = (0.18, 0.24, 0.14, 0.35)
𝑠 𝑁1 ∪ 𝑁2 = (0.18, 0.11, 0.14, 0.22)
∈ 𝑁1 ∖ 𝑁2
2 1
𝛼= =
4 2
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
4
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
4
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
2
unbiasedely with 𝜎 =
𝑘
k-partition sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = (1.00, 1.00, 0.14, 0.21) 𝑘′ = 2
𝑠 𝑁2 = (0.18, 1.00, 0.14, 0.35) 𝑘′ = 3
𝑠 𝑁1 ∪ 𝑁2 = (0.18, 1.00, 0.14, 0.21) 𝑘′ = 3
∈ 𝑁1 ∖ 𝑁2
1
𝛼=
3
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
3
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
3
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
2
unbiasedely with 𝜎 =
(conditioned on
′
𝑘
𝑘’)
Bottom-k sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
Smallest 𝑘 = 4 in
union of sketches
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18}
∈ 𝑁1 ∖ 𝑁2
2
𝛼=
4
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
4
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
4
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
𝑘−1
2
unbiasedely with 𝜎 =
1−
𝑘
𝑛−1
Bottom-k sketches: better estimate
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21
𝑘′ = 6 > 4
∈ 𝑁1 ∖ 𝑁2
∈ 𝑁2 ∖ 𝑁1
∈ 𝑁1 ∩ 𝑁2
We can look beyond the union sketch: We have complete
membership information on all elements with ℎ 𝑥 ≤
min {max 𝑠 𝑁1 , max 𝑠 𝑁2 }. We have 2k > 𝑘 ′ ≥ 𝑘 elements!
Bottom-k sketches: better estimate
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21
𝑘′ = 6 > 4
∈ 𝑁1 ∖ 𝑁2
3 1
𝛼= =
6 2
⇒
∈ 𝑁2 ∖ 𝑁1
2 1
𝛼= =
6 3
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
6
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
′ −1
𝛼
1−𝛼
𝑘
unbiasedely with 𝜎 2 =
1−
(conditioned on 𝑘’)
′
𝑘
𝑛−1
Cosine Similarity from Min-Hash sketches:
Crude estimator
C 𝑁1 , 𝑁2 =
𝑁1 ∩ 𝑁2
𝑁1 |𝑁2 |
C 𝑁1 , 𝑁2 = 𝐽(𝑁1 , 𝑁2 )
𝐽 𝑁1 , 𝑁2
|𝑁1 ∩ 𝑁2 |
=
|𝑁1 ∪ 𝑁2 |
𝑁1 ∪ 𝑁2
𝑁1 |𝑁2 |
 We have estimates with good relative error (and
1
1
concentration) for |𝑁1 ∪ 𝑁2 | ,
,
 Plug-in
N1
N2
Next: Back to distinct counting
 Inverse-probability distinct count estimators
 Separately estimate “presence” of each element
 Historic Inverse-probability distinct count
estimators
 General approach for deriving estimators: For all
distributions, all Min-Hash sketch types
1
2
 the variance of purely sketch-based estimators
Inverse probability estimators
[Horvitz Thompson 1952]
Model: There is a hidden value 𝑥. It is observed/sampled with probability
𝑝 > 0. We want to estimate 𝑓 𝑥 ≥ 0. If 𝑥 is sampled we know both 𝑝, 𝑥
and can compute 𝑓(𝑥).
Inverse Probability Estimator:
If 𝑥 is sampled 𝑓 =
Else, 𝑓 = 0
𝑓 𝑥
.
𝑝(𝑥)
 𝑓 is unbiased: 𝐸 𝑓 = 1 − 𝑝 ⋅
 Var 𝑓 = E 𝑓 𝑥
1
𝑝
𝑓 𝑥 2 ( − 1)
2
−𝑓 𝑥
2
=𝑝
𝑓 𝑥
0+𝑝
𝑝
𝑓 𝑥 2
𝑝
= 𝑓(𝑥)
−𝑓 𝑥
2
=
comment: variance is minimum possible for unbiased
nonnegative estimator if domain includes 𝑥 with 𝑓 𝑥 = 0
Inverse-Probability estimate for a sum
We want to estimate the sum: 𝑛 = 𝑥 𝑓(𝑥).
We have a sample 𝑆 of elements. 𝑓(𝑥) > 0 ⟹
𝑝 𝑥 > 0 and we know 𝑓 𝑥 , 𝑝(𝑥) when 𝑥 ∈ 𝑆.
 We use: 𝑓𝑥 =
𝑓 𝑥
𝑝(𝑥)
when 𝑥 ∈ 𝑆. 𝑓𝑥 =0 otherwise.
 Sum estimator: 𝑛 =
𝑥 𝑓𝑥
=
𝑥∈𝑆 𝑓𝑥
 Unbiased 𝑓𝑥 implies unbiased 𝑛. It is important, so
bias does not add up
 For distinct count 𝑓 𝑥 = I𝑥∈𝑁 (indicator function).
Inverse-Probability estimate for a sum
We want to estimate the sum: 𝑛 = 𝑥 𝑓(𝑥).
We have a sample 𝑆 of elements. 𝑓(𝑥) > 0 ⟹
𝑝 𝑥 > 0 and we know 𝑓 𝑥 , 𝑝(𝑥) when 𝑥 ∈ 𝑆.
 We use: 𝑓𝑥 =
𝑓 𝑥
𝑝(𝑥)
when 𝑥 ∈ 𝑆. 𝑓𝑥 =0 otherwise.
 Sum estimator: 𝑛 =
𝑥 𝑓𝑥
=
𝑥∈𝑆 𝑓𝑥
 𝑝(𝑥) can be conditioned on a part in some partition
of outcomes. But elements with f 𝑥 > 0 must
have 𝑝 𝑥 > 0 in all parts (otherwise we get bias)
Bottom-k sketches:
Inverse probability estimator
 We work with the uniform distribution
ℎ 𝑥𝑖 ∼ 𝑈[0,1]
 For each distinct element, we consider the
probability that it is one of the lowest-hash 𝑘 −
1 elements.
 For sketch 𝑦1 < ⋯ < 𝑦𝑘 , we say element 𝑥 is
“sampled” ⟺ for some 𝑖 ≤ 𝑘 − 1, 𝑦𝑖 = ℎ(𝑥)
𝑘−1
𝑛
Caveat: Probability is =
for all elements, but
we do not know 𝑛. ⇒ Need to use conditioning.
Bottom-k sketches:
Inverse probability estimator
 We use an inverse probability estimate: If 𝒙 is not
sampled (not one of the 𝑘 − 1 smallest-hash
𝟏
elements) estimate is 0. Otherwise, it is
.
𝒑(𝒙)
But we do not know 𝒑 ! what can we do ?
We compute 𝒑(𝒙) conditioned on fixing 𝒉 on 𝑵 ∖
𝒙 but taking 𝐡 𝒙 ∼ 𝑼 𝟎, 𝟏
 Need to be able to compute 𝒑(𝒙) only for
“sampled” elements.
Bottom-k sketches:
Inverse probability estimator
What is the probability 𝑝 that 𝑥 is sampled if we
fix ℎ on 𝑁 ∖ {𝑥} but take ℎ 𝑥 ∼ 𝑈 0,1 ?
𝑥 is sampled ⟺ ℎ 𝑥 < (𝑘 − 1)th ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥
For sampled 𝑥, (𝑘 − 1)th ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥 = 𝑦𝑘
⟹ 𝑝(𝑥) = 𝑦𝑘
⟹ Inverse probability estimate is
1
𝑝(𝑥)
=
1
𝑦𝑘
Summing over the 𝑘 − 1 “sampled” elements:
𝑘−1
𝑛=
𝑦𝑘
Explaining conditioning in Inverse
Probability Estimate for bottom-k
 Probability Space on {ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥}.
 Partitioned according to 𝜏 =
(𝑘 − 1)th ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥
 Conditional probability that 𝑥 is sampled in the
part is Pr ℎ(𝑥) < 𝜏 = 𝜏
 If 𝑥 is “sampled” in outcome, we know 𝜏 (it is
1
equal to 𝑦𝑘 ), estimate is . (If 𝑥 is not sampled
𝜏
then 𝜏 = 𝑦𝑘−1 > 0 – this is needed for
unbiasedness but estimate for 𝑥 is 0)
Explaining conditioning in Inverse
Probability Estimate for bottom-k
𝑁 = {𝒂, 𝑏, 𝑐, 𝑑, 𝑒}
𝑘=3
The probability that 𝒂 has one of the 𝒌 − 𝟏 = 𝟐
smallest values in ℎ 𝑎 , ℎ 𝑏 , … , ℎ 𝑒 is
𝟐
Pr 𝒂 ∈ 𝑆 = but we can not compute it since we
𝟓
do not know 𝒏(= 𝟓).
The conditional probability Pr[𝒂 ∈
Explaining conditioning in Inverse
Probability Estimate for bottom-k
ℎ 𝑏 , … , ℎ(𝑒)
(.1,.3,.5,.6)
𝜏 = 0.3
(.2,.3,.5,.71)
(.15,.3,.32,.4)
𝑘=3
? Pr 𝑎 ∈ 𝑆 ℎ 𝑏 , … , ℎ(𝑒)]
(.11,.2,.28,.3)
𝜏 = 0.2
(.03,.2,.4,.66)
(.1,.2,.7,.8)
(.1,.4,.5,.8)
𝜏 = 0.4
(.12,.4,.45,.84)
Bottom-k sketches:
Inverse probability estimators
k−1
𝑛=
𝑦𝑘
We obtain an unbiased estimator.
No need to track element IDs (sample view only
used for analysis).
How good is this estimator? We can (do not)
show:
𝜎
𝜇
 CV is ≤
estimator
1
𝑘−2
at least as good as the k-mins
Better distinct count estimators ?
 Recap:
 Our estimators (k-mins, bottom-k) have CV
1
𝑘−2
 CRLB (k-mins) says CV
𝜎
𝜇
≥
𝜎
𝜇
1
𝑘
 Can we improve ? Also, what about k-partition?
 CRLB applies when we are limited to using only the
information in the sketch.
 Idea: Use information we discard along the way
≤
“Historic” Inverse Probability Estimators
 We maintain an approximate count together with
the sketch: 𝒚𝟏 , … , 𝒚𝒌 , 𝒄
 Initially 𝒚𝟏 , … , 𝒚𝒌 ← (𝟏, … , 𝟏) 𝒄 ← 𝟎
 When the sketch 𝒚 is updated, we compute the
probability 𝑝 that a new distinct element would
cause an update to the current sketch.
 We increase the counter 𝒄 ← 𝒄 +
1
𝑝
 Easy to apply with all min-hash sketches
 The estimate is unbiased
 We can (do not) show CV
𝜎
𝜇
≤
1
2𝑘−2
<
1
1
2 𝑘−2
Maintaining a k-mins “historic” sketch
k-mins sketch: Use 𝑘 “independent” hash functions: ℎ1 , ℎ2 , … , ℎ𝑘
Track the respective minimum 𝑦1 , 𝑦2 , … , 𝑦𝑘 for each function.
Update probability: probability 𝑝 that at least for one 𝑖 =
1, … , 𝑘, we get ℎ𝑖 𝑥 < 𝑦𝑖 :
𝑘
𝑝=1−
(1 − 𝑦𝑖 )
𝑖
Processing a new element 𝑥 :
 𝑝 ← 1 − 𝑘𝑖 (1 − 𝑦𝑖 )
 For 𝑖 = 1, … , 𝑘 : 𝑦𝑖 ← min{ 𝑦𝑖 , ℎ𝑖 𝑥 }
 If change in 𝑦 : 𝑐 ← 𝑐 +
1
𝑝
Maintaining a k-partition “historic” sketch
Processing a new element 𝑥 :
 𝑖 ← first log 2 𝑘 bits of ℎ′(𝑥)
 ℎ ← remaining bits of ℎ′(𝑥)
 If 𝑦𝑖 < ℎ ,
𝑝←
1
𝑘
𝑐←𝑐
𝑘
𝑗=1 𝑦𝑗
1
+
𝑝
,
 𝑦𝑖 ← ℎ
Update probability: probability 𝑝 that ℎ 𝑥 < 𝑦𝑖 for
part 𝑖 selected uniformly at random
Maintaining a bottom-k “historic” sketch
Bottom-k sketch: Use a single hash function: ℎ
Track the 𝑘 smallest values 𝑦1 < 𝑦2 < ⋯ < 𝑦𝑘
Processing a new element 𝑥 :
If ℎ 𝑥 < yk
 c←𝑐+
1
𝑦𝑘
 𝑦1 , 𝑦2 , … , 𝑦𝑘 ← sort {𝑦1 , 𝑦2 , … , 𝑦𝑘−1 , ℎ(𝑥)}
Probability of update is: yk
Summary: Historic distinct estimators
Recap:
1
2𝑘−2
1
1
2 𝑘−2
 Maintain sketch and count. CV is
<
 Easy to apply. Trivial to query. Unbiased.
More: (we do not show here)
 CV is almost tight for this type of estimator
(estimate presence of each distinct element
1
entering sketch). ⟹ Can’t do better than
2𝑘
 Mergeability: Stated for streams. “Sketch” parts
are mergeable but merging “counts” requires
work (which uses the sketch parts)
 Approach: carefully estimate the overlap (say, using
similarity estimators)
Next:
 Working with a small range
 So far Min-Hash sketches were
stated/analyzed for distributions (random
hash functions) with a continuous range
 We explain how to work with a discrete
range, how small the representation can
be, and how estimators are affected.
 Back-of-the-envelope calculations
Working with a small (discrete) range
When implementing min-hash sketches:
 We work with discrete distribution of the hash range
 We want to use as fewer bits to represent the sketch.
Natural discrete distribution: ℎ 𝑥 = 2−𝑖 with
probability 2−𝑖
 Same as using u ∼ 𝑈[0,1] and retaining only
1
the negated exponent ⌊log 2 ⌋.
u
 Expectation of the min is about
1
n
≈
1 −log n
2
 Expected max exponent size is ≈ log 2 log 2 𝑛
Elements sorted by hash
Negated exponent:
1
0.1xxxxx
2
0.01xx
4
0.0001xx
3
0.001xx
Working with a small (discrete) range
 Can also retain few (𝑏) bits beyond the exponent.
 Sketch size is ≈ 𝑘𝑏 + 𝑘 log 2 log 2 𝑛
 Can be reduced further to log 2 log 2 𝑛 + 𝑂 𝑘𝑏 by
noting that exponents parts are very similar, so
can store only the minimum once and “offsets”.
How does this rounding affect the estimators
(properties and accuracy) ?
We do “back-of-the-envelope” calculations
Working with a small (discrete) range
“parameter estimation” estimators; Similarity estimators
We need to keep enough bits to ensure distinctness of minhash values in the same sketch (for similarity, two sketches)
with good probability. To apply “continuous” estimators, we
can take a random completion and apply the estimators.
k-mins and k-partition: we can separately look at each
“coordinate”. The expected number of elements with
same “minimum” exponent is fixed. (The probability of
exponent −𝑗 is 2−𝑗 , so expectation is 𝑛2−𝑗 ). So we can
work with a fixed 𝑏.
Working with a small (discrete) range
“parameter estimation” estimators; Similarity estimators
We need to keep enough bits to ensure distinctness of minhash values in the same sketch (for similarity, two sketches)
with good probability. To apply “continuous” estimators, we
can take a random completion and apply the estimators.
bottom-k: we need to separate the smallest 𝑘 values.
We expect about 𝑘/2 to have the maximum
represented exponent. So we need log log 𝑛 +
𝑂 log 𝑘 bits per register. We work with 𝑏 = 𝑂 log 𝑘
Working with a small (discrete) range
Inverse probability (also historic) estimators:
Estimators apply directly to discrete range: simply work with
the probability that a hash from the discrete domain is strictly
below current “threshold”
 Unbiasedness still holds (on streams) even with likely hash
collisions (with k-mins and k-partition)
1
1−2−𝑏
 Variance increases by ×
⇒ we get most of the value
of continuous domain with small 𝑏
 For mergeability (support the needed similarity-like
estimates to merge counts) or with bottom-k, we need to
work with larger 𝑏 = 𝑂(𝑙𝑜𝑔 𝑘) to ensure that hash
collisions are not likely (on same sketch or two sketches).
Distinct counting/Min-Hash sketches bibliography 1
First use of k-mins Min-Hash sketches for distinct counting; first streaming algorithm for approximate
distinct counting:

P. Flajolet and N. Martin, N. “Probabilistic Counting Algorithms for Data Base
Applications” JCSS (31), 1985.
Use of Min-Hash sketches for similarity, union size, mergeability, size estimation (k-mins, propose
bottom-k):

E. Cohen “Size estimation framework with applications to transitive closure and
reachability”, JCSS (55) 1997
Use of shingling with k-mins sketches for Jaccard similarity of text documents:


A. Broder “On the Resemblance and Containment of Documents” Sequences 1997
A. Broder and S. Glassman and M. Manasse and G. Zweig “Syntactic Clustering of
the Web” SRC technical note 1997
Better similarity estimators (beyond the union sketch) from bottom-k samples:

E. Cohen and H. Kaplan “Leveraging discarded sampled for tighter estimation of
multiple-set aggregates: SIGMETRICS 2009.
Asymptotic Lower bound on distinct counter size (taking into account hash representation)

N. Alon Y. Matias M. Szegedy “The space complexity of approximating the frequency moments”
STOC 1996
Introducing k-partition sketches for distinct counting:

Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan “Counting distinct elements in a
data stream” RANDOM 2002.
Distinct counting/Min-Hash sketches bibliography 2
Practical distinct counters based on k-partition sketches:
 P. Flajolet, E. Fusy, O. Gandouet, F. Meunier “Hyperloglog: The analysis of a near-optimal
cardinality estimation algorithm”
 S. Heule, M. Nunkeser, A. Hall “Hyperloglog in practice” algorithmic engineering of a state of
the art cardinality estimation algorithm”, EDBT 2013
Theoretical algorithm with asymptotic bounds that match the AMS lower bound:
 D.M. Kane, J. Nelson, D. P, Woodruff “An optimal algorithm for the distinct elements
problem”, PODS 2010
Inverse probability “historic” estimators, Application of Cramer Rao on min-hash sketches:
 E. Cohen “All-Distances Sketches, Revisited: Scalable Estimation of the Distance Distribution
and Centralities in Massive Graphs” arXiv 2013.
The concepts of min-hash sketches and sketch coordination are related to concepts from the
survey sampling literature: Order samples (bottom-k), coordination of samples using the PRN
method (Permanent Random Numbers).
More on Bottom-k sketches, ML estimator for bottom-k:
 E. Cohen, H. Kaplan “Summarizing data using bottom-k sketches” PODS 2007. “Tighter
Estimation using bottom-k sketches” VLDB 2008.
Inverse probability estimator with priority (type of bottom-k) sketches:
 N. Alon, N. Duffield, M. Thorup, C. Lund: “Estimating arbitrary subset sums with a few
probes” PODS 2005
Download