Using Sketches to Estimate Associations Ping Li Kenneth Church Cornell Microsoft Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d 3/27/2009 DIMACS 1 On Delivering Embarrassingly Distributed Cloud Services Hotnets-2008 Board Affordabl Ken Church e Albert Greenberg $1B James Hamilton {church, albert, jamesrh}@microsoft.com $2M 3/27/2009 DIMACS 2 2 Containers: Disruptive Technology • Implications for Shipping – New Ships, Ports, Unions • Implications for Hotnets – – – – New Data Center Designs Power/Networking Trade-offs Cost Models: Expense vs. Capital Apps: Embarrassingly Distributed • Restriction on Embarrassingly Parallel – Machine Models: • Distributed Parallel Cluster Parallel Cluster 3/27/2009 DIMACS 3 3 Mega vs. Micro Data Centers POPs 1 10 100 1,000 10,000 100,000 1,000,000 3/27/2009 Cores/ POP Hardware/ Co-located POP With/Near 1,000,000 1000 containers Mega 100,000 100 containers Data Center 10,000 10 containers Fiber Hotel/ 1,000 1 container Power Substation 100 1 rack Central Office 10 1 mini-tower P2P 1 embedded DIMACS 4 4 Related Work • http://en.wikipedia.org/wiki/Data_center – A data center can occupy one room of a building… – Servers differ greatly in size from 1U servers to large … silos – Very large data centers may use shipping containers...[2] 220 containers in one PoP 220 in 220 PoPs 3/27/2009 DIMACS 5 5 Embarrassingly Distributed Probes • W1 & W2 are Shipping Containers • Lots of bandwidth within a container – But less across containers • Limited Bandwidth Sampling Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d 3/27/2009 DIMACS 6 ≈ 1990 Strong: 427M (Google) Powerful: 353M (Google) 3/27/2009 DIMACS Page Hits ≈ 1000x BNC freqs 7 Turney (and Know It All) 3/27/2009 DIMACS PMI + The Web: Better together 8 “It never pays to think until you’ve run out of data” – Eric Brill Moore’s Law Constant: Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) Data Collection Rates Improvement Rates More data is better data! Fire everybody and spend the money on data 3/27/2009 DIMACS Quoted out of context No consistently best learner 9 Page Hits Estimates by MSN and Google More Freq (August 2005) Query Hits (MSN) Larger corpora Larger counts More signal Hits (Google) A 2,452,759,266 3,160,000,000 The 2,304,929,841 3,360,000,000 Kalevala 159,937 214,000 Griseofulvin 105,326 149,000 38,202 147,000 Saccade # of (English) documents D ≈ 1010 . Lots of hits even for very rare words. 3/27/2009 DIMACS Less Freq 10 Caution: Estimates ≠ Actuals Query Hits (MSN) America 150,731,182 393,000,000 15,240,116 66,000,000 America & China & Britain 235,111 6,090,000 America & China & Britain & Japan 154,444 23,300,000 America & China Hits (Google) These are just (quick-and-dirty) estimates (not actuals) Joint frequencies ought to decrease monotonically as we add more terms to the query. 3/27/2009 DIMACS 11 Rule of Thumb breaks down when there are strong interactions (Common for cases(Governator): of most interest) Query Planning Query Austria One-way Two-way Three-way Four-way 3/27/2009 Governor Hits (Google) Rule of Thumb 88,200,000 37,300,000 Schwarzenegger 4,030,000 Terminator 3,480,000 Governor & Schwarzenegger 1,220,000 Governor & Austria 708,000 Schwarzenegger & Terminator 504,000 Terminator & Austria 171,000 Governor & Terminator 132,000 Schwarzenegger & Austria 120,000 Governor & Schwarzenegger & Terminator 75,100 Governor & Schwarzenegger & Austria 46,100 Schwarzenegger & Terminator & Austria 16,000 Governor & Terminator & Austria 11,500 Governor & Schwarzenegger & Terminator & Austria DIMACS 6,930 12 Associations: PMI, MI, Cos, R, Cor… Summaries of Contingency Table W2 ~W2 W1 a b ~ W1 c Margins (aka doc freq) d f1 = a + b a : # of documents that contain both Word W1 and Word W2 b : # of documents that contain Word W1 but not Word W2 f2 = a + c D = a+b+c+d • Need just one more constraint • To compute table (& summaries) – 4 parameters: a, b, c, d – 3 constraints: f1, f2, D 3/27/2009 DIMACS 13 Postings Margins (and more) (Postings aka Inverted File) Postings(w): A sorted list of doc IDs for w PIG … 13 25 33 … This pig is so cute … … saw a flying pig … Doc #13 Doc #25 … was raining pigs and eggs … Doc #33 Assume doc IDs are random 3/27/2009 DIMACS 14 Conventional Random Sampling (Over Documents) Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d Ds = as + bs + cs + d s Sample Size 3/27/2009 aˆ MF D as Ds Margin-Free Baseline DIMACS 15 Random Sampling • Over documents – Simple & well understood – But problematic for rare events • Over postings k as = a f 2 Undesirable – where f = |P| (P = postings, aka inverted file) – aka doc freq or margin 3/27/2009 DIMACS 16 W1 as bs ~ W1 cs ds Sketches >> Random Samples Best k as a f Undesirable Better 3/27/2009 k as = a f DIMACS 2 17 Outline • Review random sampling – and introduce a running example • Sample: Sketches – A generalization of Broder’s Original Method – Sketches: • Advantages: Larger as than random sampling • Disadvantages: Estimation more challenging • Estimation: Maximum Likelihood (MLE) • Evaluation 3/27/2009 DIMACS 18 Random Sampling over Documents W2 ~W2 W1 as bs ~ W1 cs ds • Doc IDs are random integers between 1 and D=36 • Small circles word W1 • Small squares word W2 • Choose a sample size: Ds = 18. Sampling rate = Ds/D = 50% • Construct sample contingency table: as = #|{4,15}| = 2, bs = #|{3, 7, 9, 10, 18}| = 5, cs = #|{2,5,8}| = 3, ds = #|{1,6,11,12,13,14,17}|= 8 • Estimation: a ≈ D/Ds as • But that doesn’t take advantage of margins 3/27/2009 DIMACS 19 Proposed Sketches Sketch = Front of Postings Postings P1: 3 4 7 9 10 15 18 19 24 25 28 33 P2: 2 4 5 8 15 19 21 24 27 28 31 35 Throw out red Choose sample size: Ds = 18 = min(18, 21) W2 as = |{4,15}| = 2 Based on bs = 7 – as = 5 as W1 blue − red cs = 5 – as = 3 ds = Ds – as – bs – cs = 8 cs ~W 3/27/2009 DIMACS 1 ~W2 bs ds 20 • Estimation: Maximum Likelihood (MLE) D aˆ MF as Consider all Ds possible contingency tables: When we know the margins, We ought to use them aˆ MLE = arg max P(as , bs , cs , d s | Ds ; a) a – a, b, c & d Pas , bs , cs , d s | Ds ; a • Select the a b c d D table that = maximizes the as bs cs d s Ds probability of a f1 a f 2 a D f1 f 2 + a D observations = – as, bs, cs & ds 3/27/2009 as bs DIMACS cs ds Ds 21 Exact MLE First derivative of log likelihood log Pas , bs , cs , d s | Ds ; a a as 1 cs 1 d s 1 1 bs 1 1 1 1 = + + + i =0 a i i = 0 f1 a i i =0 f 2 a i i = 0 D f1 f 2 + a i log Pas , bs , cs , d s | Ds ; a =0 a Problem: 3/27/2009 gives the MLE solution Too complicated. Numerical problems. DIMACS 22 Exact MLE Second derivative: log Pas , bs , cs , d s | Ds ; a 0 2 a 2 Log likelihood function is concave unique maximum PMF updating formula: Pas , bs , cs , d s | Ds ; a = Pas , bs , cs , d s | Ds ; a 1 g (a) Suffice to solve g(a) = 1. 3/27/2009 DIMACS 23 Exact MLE MLE solution: g (a) a f1 a + 1 bs f 2 a + 1 cs D f1 f 2 + a = a as f1 a + 1 f 2 a + 1 D f1 f 2 + a d s =1 is a cubic function of a. 3/27/2009 DIMACS 24 An Approximate MLE Suppose we were sampling from the two inverted files directly and independently. x ~x y ~y as bs cs ds nx = aPs (a+ ,b bs; a) s s Ds = as + bs + cs + ds nP = a,cs ;+a)cs y (a s s Approximate MLE: Maximize P(as, bs, cs; a) = P(as, bs; a) × P(as,cs; a) 3/27/2009 DIMACS 25 An Approximate MLE Convenient Closed-Form Solution Pr as , bs , cs | a a 2 as fx a f y a bs cs • Convenient Closed-Form 2as bs cs Take log of both sides; = 0 • Surprisingly accurate a fx a f y a Set derivative = 0 • Recommended aˆMLE ,i = f x 2as + cs + f y 2as + bs f 2a + c + f 2a + b x s s y s s 2 8 f x f y as 2as + bs + cs 2 2as + bs + cs 3/27/2009 DIMACS 26 Margin-Free Baseline Independence Baseline Evaluation When we know the margins, We ought to use them Proposed Best 3/27/2009 DIMACS 27 Theoretical Evaluation • Not surprisingly, there is a trade-off between – Computational work: space, time – Statistical Accuracy: variance, error • Formulas state trade-off precisely in terms of sampling rate: Ds/D • Theoretical evaluation: – Proposed MLE is better than Margin Free baseline – Confirms empirical evaluation 3/27/2009 DIMACS 28 How many samples are enough? Sampling rate to achieve cv = SE/a < 0.5 Larger D Smaller sampling rate Cluster of 10k machines A Single machine At web scale (D≈1010), sampling rate 10- 4 may suffice for “ordinary” words. 3/27/2009 DIMACS 29 Broder’s Sketch: Original & Minwise Estimate Resemblance (R) P1 P2 • Notation a R= = P1 P2 f1 + f 2 a – Words: w1, w2 – Postings: P1, P2 • Set of doc IDs – Resemblance: R – Random Permutation: π R = Pmin P1 = min P2 • Minwise Sketch: – Permute doc IDs k times: πk – For each πi, let mini(P) be smallest doc ID in πi(P) k 1 Rˆ = min i P1 = min i P2 k i =1 • Original Sketch: – Sketches: K1, K2 • Set of doc IDs • (front of postings) – Permute doc IDs once: π – Let K=firstk(P) be the first k doc IDs in π(P) 3/27/2009 Throw out half K1 = first k ( P1 ) K 2 = first k ( P2 ) aˆs = first k K1 K 2 K1 K 2 DIMACS 30 Multi-way Associations: Evaluation MSE relative improvement over margin-free baseline When we know the margins, we ought to use them Gains are larger for 2-way than multi-way Degree of freedom = 2m – (m+1), increases exponentially, suggesting margin constraints become less important as m increases. 3/27/2009 DIMACS 31 Conclusions (1 of 2) • When we know the margins, We ought to use them Estimating Contingency Tables: – Fundamental Problem • Practical app: – Estimating Page Hits for two or more words (Governator) – Know It All: Estimating Mutual Information from Page Hits • Baselines: – Independence: Ignore interactions (Awful) – Margin-Free: Ignore postings (Wasteful) – (≈2x) Broder’s Sketch (WWW 97, STOC 98, STOC 2002) • Throws out half the sample – (≈10x) Random Projections (ACL2005, STOC2002) • Proposed Method: – Sampling: like Broder’s Sketch, but throws out less • Larger as than random sampling – Estimation: MLE (Maximum Likelihood) • MF: Estimation is easy without margin constraints • MLE: Find most likely contingency table, – Given observations: as, bs, cs, ds 3/27/2009 DIMACS aˆ MF D as Ds 32 Rising Tide of Data Lifts All Boats Conclusions (2 of 2) If you have a lot of data, then you don’t need a lot of methodology • Recommended Approximation • Trade-off between – Computational work (space and time) and – Statistical accuracy (variance and errors) • Derived formulas for variance – Showing how trade-off depends on sampling rate • At Web scales, sampling rate (Ds/D) 10–4 – A cluster of 10k machines A single machine aˆ MLE ,a = f1 2as + cs + f 2 2as + bs 3/27/2009 f1 2as + cs + f 2 2as + bs 2 8 f1 f 2 as 2as + bs + cs 22as + bs + cs DIMACS 33 Backup Comparison with Broder’s Algorithm Broder’s Method has larger variance (≈2x) Because it uses only half the sketch Var Rˆ MLE Var Rˆ B f1 f 2 k max , k1 k 2 f1 + f 2 Var(RMLE) << VAR(RB) Equal samples max f1 , f 2 , if k1 = k 2 = k f +f 1 2 = f1 f2 1 , if k1 = 2k , k 2 = 2k 2 f1 + f 2 f1 + f 2 Proportional samples when a min( f1, f 2 ) max( f1, f 2 ) D 3/27/2009 DIMACS 35 Comparison with Broder’s Algorithm Ratio of variances (equal samples) 3/27/2009 DIMACS Var(RMLE) << VAR(RB) 36 Comparison with Broder’s Algorithm Ratio of variances (proportional samples) Var(RMLE) << VAR(RB) 3/27/2009 DIMACS 37 Comparison with Broder’s Algorithm Estimation of Resemblance Broder’s method throws out half the samples 50% improvement 3/27/2009 DIMACS 38 Comparison with Random Projections Estimation of Angle Huge Improvement 3/27/2009 DIMACS 39 Comparison with Random Projections 10x Improvement 3/27/2009 DIMACS 40