Using Sketches to Estimate Associations

advertisement
Using Sketches to
Estimate Associations
Ping Li
Kenneth Church
Cornell
Microsoft
Sample contingency table
W2
~W2
Original contingency table
W2
~W2
W1
as
bs
W1
a
b
~ W1
cs
ds
~ W1
c
d
3/27/2009
DIMACS
1
On Delivering Embarrassingly
Distributed Cloud Services
Hotnets-2008
Board
Affordabl
Ken Church
e
Albert Greenberg
$1B
James Hamilton
{church, albert, jamesrh}@microsoft.com
$2M
3/27/2009
DIMACS
2
2
Containers:
Disruptive
Technology
• Implications for Shipping
– New Ships, Ports, Unions
• Implications for Hotnets
–
–
–
–
New Data Center Designs
Power/Networking Trade-offs
Cost Models: Expense vs. Capital
Apps: Embarrassingly Distributed
• Restriction on Embarrassingly Parallel
– Machine Models:
• Distributed Parallel Cluster  Parallel Cluster
3/27/2009
DIMACS
3
3
Mega vs. Micro Data Centers
POPs
1
10
100
1,000
10,000
100,000
1,000,000
3/27/2009
Cores/
POP
Hardware/ Co-located
POP With/Near
1,000,000 1000 containers Mega
100,000 100 containers Data Center
10,000
10 containers Fiber Hotel/
1,000
1 container Power Substation
100
1 rack Central Office
10
1 mini-tower
P2P
1
embedded
DIMACS
4
4
Related Work
• http://en.wikipedia.org/wiki/Data_center
– A data center can occupy one room of a
building…
– Servers differ greatly in size from 1U servers
to large … silos
– Very large data centers may use shipping
containers...[2]
220 containers in one PoP

220 in 220 PoPs
3/27/2009
DIMACS
5
5
Embarrassingly Distributed Probes
• W1 & W2 are Shipping Containers
• Lots of bandwidth within a container
– But less across containers
• Limited Bandwidth  Sampling
Sample contingency table
W2
~W2
Original contingency table
W2
~W2
W1
as
bs
W1
a
b
~ W1
cs
ds
~ W1
c
d
3/27/2009
DIMACS
6
≈ 1990
Strong: 427M
(Google)
Powerful: 353M
(Google)
3/27/2009
DIMACS
Page Hits ≈ 1000x
BNC freqs
7
Turney (and Know It All)
3/27/2009
DIMACS
PMI + The Web:
Better together
8
“It never pays to think until you’ve
run out of data” – Eric Brill
Moore’s Law Constant:
Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001)
Data Collection Rates  Improvement Rates
More
data is
better
data!
Fire everybody and
spend the money on data
3/27/2009
DIMACS
Quoted out of context
No consistently
best learner
9
Page Hits Estimates
by MSN and Google
More Freq
(August 2005)
Query
Hits (MSN)
Larger corpora 
Larger counts 
More signal
Hits (Google)
A
2,452,759,266
3,160,000,000
The
2,304,929,841
3,360,000,000
Kalevala
159,937
214,000
Griseofulvin
105,326
149,000
38,202
147,000
Saccade
# of (English) documents
D ≈ 1010 .
Lots of hits even for very rare words.
3/27/2009
DIMACS
Less Freq
10
Caution: Estimates ≠ Actuals
Query
Hits (MSN)
America
150,731,182
393,000,000
15,240,116
66,000,000
America & China & Britain
235,111
6,090,000
America & China & Britain & Japan
154,444
23,300,000
America & China
Hits (Google)
These are just (quick-and-dirty) estimates (not actuals)
Joint frequencies ought to decrease monotonically
as we add more terms to the query.
3/27/2009
DIMACS
11
Rule of Thumb breaks down when there are strong interactions
(Common
for cases(Governator):
of most interest)
Query
Planning
Query
Austria
One-way
Two-way
Three-way
Four-way
3/27/2009
Governor
Hits (Google)
Rule of Thumb
88,200,000
37,300,000
Schwarzenegger
4,030,000
Terminator
3,480,000
Governor & Schwarzenegger
1,220,000
Governor & Austria
708,000
Schwarzenegger & Terminator
504,000
Terminator & Austria
171,000
Governor & Terminator
132,000
Schwarzenegger & Austria
120,000
Governor & Schwarzenegger & Terminator
75,100
Governor & Schwarzenegger & Austria
46,100
Schwarzenegger & Terminator & Austria
16,000
Governor & Terminator & Austria
11,500
Governor & Schwarzenegger
& Terminator & Austria
DIMACS
6,930
12
Associations: PMI, MI, Cos, R, Cor…
Summaries of Contingency Table
W2
~W2
W1
a
b
~ W1
c
Margins
(aka doc freq)
d
f1 = a + b
a : # of documents that contain
both Word W1 and Word W2
b : # of documents that contain
Word W1 but not Word W2
f2 = a + c
D = a+b+c+d
• Need just one more constraint
• To compute table (& summaries)
– 4 parameters: a, b, c, d
– 3 constraints: f1, f2, D
3/27/2009
DIMACS
13
Postings  Margins (and more)
(Postings aka Inverted File)
Postings(w): A sorted list of doc IDs for w
PIG
…
13 25 33
… This pig
is so cute …
… saw a
flying pig …
Doc #13
Doc #25
… was raining
pigs and eggs …
Doc #33
Assume doc IDs are random
3/27/2009
DIMACS
14
Conventional Random Sampling
(Over Documents)
Sample contingency table
W2
~W2
Original contingency table
W2
~W2
W1
as
bs
W1
a
b
~ W1
cs
ds
~ W1
c
d
Ds = as + bs + cs + d s
Sample Size
3/27/2009
aˆ MF
D

as
Ds
Margin-Free Baseline
DIMACS
15
Random Sampling
• Over documents
– Simple & well understood
– But problematic for rare events
• Over postings
k
as = a
f



2
Undesirable
– where f = |P| (P = postings, aka inverted file)
– aka doc freq or margin
3/27/2009
DIMACS
16
W1
as
bs
~ W1
cs
ds
Sketches >> Random Samples
Best
k
as  a
f
Undesirable
Better
3/27/2009
k
as = a
f
DIMACS



2
17
Outline
• Review random sampling
– and introduce a running example
• Sample: Sketches
– A generalization of Broder’s Original Method
– Sketches:
• Advantages: Larger as than random sampling
• Disadvantages: Estimation  more challenging
• Estimation: Maximum Likelihood (MLE)
• Evaluation
3/27/2009
DIMACS
18
Random Sampling over Documents
W2
~W2
W1
as
bs
~ W1
cs
ds
• Doc IDs are random integers between 1 and D=36
• Small circles  word W1
• Small squares  word W2
• Choose a sample size: Ds = 18. Sampling rate = Ds/D = 50%
• Construct sample contingency table:
as = #|{4,15}| = 2, bs = #|{3, 7, 9, 10, 18}| = 5,
cs = #|{2,5,8}| = 3, ds = #|{1,6,11,12,13,14,17}|= 8
• Estimation: a ≈ D/Ds as
• But that doesn’t take advantage of margins
3/27/2009
DIMACS
19
Proposed Sketches
Sketch = Front of Postings
Postings
P1: 3 4 7 9 10 15 18 19 24 25 28 33
P2: 2 4 5 8 15 19 21 24 27 28 31 35
Throw out red
Choose sample size: Ds = 18 = min(18, 21)
W2
as = |{4,15}| = 2
Based on
bs = 7 – as = 5
as
W1
blue
−
red
cs = 5 – as = 3
ds = Ds – as – bs – cs = 8
cs
~W
3/27/2009
DIMACS
1
~W2
bs
ds
20
•
Estimation: Maximum
Likelihood (MLE)
D
aˆ MF 
as
Consider all
Ds
possible
contingency
tables:
When we know the margins,
We ought to use them
aˆ MLE = arg max P(as , bs , cs , d s | Ds ; a)
a
– a, b, c & d
Pas , bs , cs , d s | Ds ; a 
• Select the
 a  b  c  d   D 
table that
=       
maximizes the  as  bs  cs  d s   Ds 
probability of
 a  f1  a  f 2  a  D  f1  f 2 + a   D 
observations =  


  
– as, bs, cs & ds
3/27/2009
 as 
bs 
DIMACS
cs 
ds
  Ds 
21
Exact MLE
First derivative of log likelihood
 log Pas , bs , cs , d s | Ds ; a 
a
as 1
cs 1
d s 1
1 bs 1
1
1
1
=
+
+
+
i =0 a  i
i = 0 f1  a  i
i =0 f 2  a  i
i = 0 D  f1  f 2 + a  i
 log Pas , bs , cs , d s | Ds ; a 
=0
a
Problem:
3/27/2009
gives the MLE solution
Too complicated. Numerical problems.
DIMACS
22
Exact MLE
Second derivative:
 log Pas , bs , cs , d s | Ds ; a 
0
2
a
2
Log likelihood function is concave  unique maximum
PMF updating formula:
Pas , bs , cs , d s | Ds ; a  = Pas , bs , cs , d s | Ds ; a  1 g (a)
Suffice to solve g(a) = 1.
3/27/2009
DIMACS
23
Exact MLE
MLE solution:
g (a)
a f1  a + 1  bs f 2  a + 1  cs D  f1  f 2 + a
=
a  as f1  a + 1
f 2  a + 1 D  f1  f 2 + a  d s
=1
is a cubic function of a.
3/27/2009
DIMACS
24
An Approximate MLE
Suppose we were sampling from the two inverted
files directly and independently.
x
~x
y
~y
as
bs
cs
ds
nx = aPs (a+ ,b
bs; a)
s
s
Ds = as + bs + cs + ds
nP
= a,cs ;+a)cs
y (a
s
s
Approximate MLE: Maximize
P(as, bs, cs; a) = P(as, bs; a) × P(as,cs; a)
3/27/2009
DIMACS
25
An Approximate MLE
Convenient Closed-Form Solution
Pr  as , bs , cs | a   a
2 as
 fx  a  f y  a
bs
cs
• Convenient Closed-Form
2as
bs
cs
Take log of both sides;


= 0 • Surprisingly
accurate
a
fx  a f y  a
Set derivative = 0
• Recommended
aˆMLE ,i =
f x  2as + cs  + f y  2as + bs  
 f  2a + c  + f  2a + b  
x
s
s
y
s
s
2
 8 f x f y as  2as + bs + cs 
2  2as + bs + cs 
3/27/2009
DIMACS
26
Margin-Free
Baseline
Independence
Baseline
Evaluation
When we know the margins,
We ought to use them
Proposed
Best
3/27/2009
DIMACS
27
Theoretical Evaluation
• Not surprisingly, there is a trade-off between
– Computational work: space, time
– Statistical Accuracy: variance, error
• Formulas state trade-off precisely in terms of sampling
rate: Ds/D
• Theoretical evaluation:
– Proposed MLE is better than Margin Free baseline
– Confirms empirical evaluation
3/27/2009
DIMACS
28
How many samples are enough?
Sampling rate to achieve cv = SE/a < 0.5
Larger D 
Smaller sampling rate
Cluster of 10k machines
 A Single machine
At web scale (D≈1010), sampling rate 10- 4
may suffice for “ordinary” words.
3/27/2009
DIMACS
29
Broder’s Sketch: Original & Minwise
Estimate Resemblance (R)
P1  P2
• Notation
a
R=
=
P1  P2
f1 + f 2  a
– Words: w1, w2
– Postings: P1, P2
• Set of doc IDs
– Resemblance: R
– Random Permutation: π
R = Pmin P1  = min P2 
• Minwise Sketch:
– Permute doc IDs k times: πk
– For each πi, let mini(P) be
smallest doc ID in πi(P)
k
1
Rˆ =  min i P1  = min i P2 
k i =1
• Original Sketch:
– Sketches: K1, K2
• Set of doc IDs
• (front of postings)
– Permute doc IDs once: π
– Let K=firstk(P) be the first k
doc IDs in π(P)
3/27/2009
Throw out
half
K1 = first k ( P1 )
K 2 = first k  ( P2 )
aˆs = first k K1  K 2   K1  K 2
DIMACS
30
Multi-way Associations: Evaluation
MSE relative improvement over margin-free baseline
When we know the margins,
we ought to use them
Gains are larger for 2-way
than multi-way
Degree of freedom = 2m – (m+1), increases
exponentially, suggesting margin constraints
become less important as m increases.
3/27/2009
DIMACS
31
Conclusions (1 of 2)
•
When we know the margins,
We ought to use them
Estimating Contingency Tables:
– Fundamental Problem
•
Practical app:
– Estimating Page Hits for two or more words (Governator)
– Know It All: Estimating Mutual Information from Page Hits
•
Baselines:
– Independence: Ignore interactions (Awful)
– Margin-Free: Ignore postings (Wasteful)
– (≈2x) Broder’s Sketch (WWW 97, STOC 98, STOC 2002)
• Throws out half the sample
– (≈10x) Random Projections (ACL2005, STOC2002)
•
Proposed Method:
– Sampling: like Broder’s Sketch, but throws out less
• Larger as than random sampling
– Estimation: MLE (Maximum Likelihood)
• MF: Estimation is easy without margin constraints
• MLE: Find most likely contingency table,
– Given observations: as, bs, cs, ds
3/27/2009
DIMACS
aˆ MF
D

as
Ds
32
Rising Tide of Data Lifts All Boats
Conclusions
(2
of
2)
If you have a lot of data,
then
you don’t need
a lot of methodology
• Recommended
Approximation
• Trade-off between
– Computational work (space and time) and
– Statistical accuracy (variance and errors)
• Derived formulas for variance
– Showing how trade-off depends on sampling rate
• At Web scales, sampling rate (Ds/D)  10–4
– A cluster of 10k machines  A single machine
aˆ MLE ,a =
f1 2as + cs  + f 2 2as + bs  
3/27/2009
 f1 2as + cs  + f 2 2as + bs 2  8 f1 f 2 as 2as + bs + cs 
22as + bs + cs 
DIMACS
33
Backup
Comparison with Broder’s Algorithm
Broder’s Method has larger variance (≈2x)
Because it uses only half the sketch

Var Rˆ MLE
Var Rˆ B
 

 f1 f 2 
k max  , 
k1 k 2 


f1 + f 2
Var(RMLE) << VAR(RB)
Equal
samples
 max  f1 , f 2 
, if k1 = k 2 = k
 f +f
 1 2
=
f1
f2
1
, if k1 = 2k
, k 2 = 2k
 2
f1 + f 2
f1 + f 2
Proportional
samples
when a  min( f1, f 2 )  max( f1, f 2 )  D
3/27/2009
DIMACS
35
Comparison with Broder’s Algorithm
Ratio of variances (equal samples)
3/27/2009
DIMACS
Var(RMLE) << VAR(RB)
36
Comparison with Broder’s Algorithm
Ratio of variances (proportional samples)
Var(RMLE) << VAR(RB)
3/27/2009
DIMACS
37
Comparison with Broder’s Algorithm
Estimation of Resemblance
Broder’s method throws
out half the samples 
50% improvement
3/27/2009
DIMACS
38
Comparison with Random Projections
Estimation of Angle
Huge
Improvement
3/27/2009
DIMACS
39
Comparison with Random Projections
10x Improvement
3/27/2009
DIMACS
40
Download