Using Sketches to Estimate Associations

Using Sketches to Estimate Associations Ping Li Kenneth Church Cornell Microsoft Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d 3/27/2009 DIMACS 1 On Delivering Embarrassingly Distributed Cloud Services Hotnets-2008 Board Affordabl Ken Church e Albert Greenberg $1B James Hamilton {church, albert, jamesrh}@microsoft.com $2M 3/27/2009 DIMACS 2 2 Containers: Disruptive Technology • Implications for Shipping – New Ships, Ports, Unions • Implications for Hotnets – – – – New Data Center Designs Power/Networking Trade-offs Cost Models: Expense vs. Capital Apps: Embarrassingly Distributed • Restriction on Embarrassingly Parallel – Machine Models: • Distributed Parallel Cluster  Parallel Cluster 3/27/2009 DIMACS 3 3 Mega vs. Micro Data Centers POPs 1 10 100 1,000 10,000 100,000 1,000,000 3/27/2009 Cores/ POP Hardware/ Co-located POP With/Near 1,000,000 1000 containers Mega 100,000 100 containers Data Center 10,000 10 containers Fiber Hotel/ 1,000 1 container Power Substation 100 1 rack Central Office 10 1 mini-tower P2P 1 embedded DIMACS 4 4 Related Work • http://en.wikipedia.org/wiki/Data_center – A data center can occupy one room of a building… – Servers differ greatly in size from 1U servers to large … silos – Very large data centers may use shipping containers...[2] 220 containers in one PoP  220 in 220 PoPs 3/27/2009 DIMACS 5 5 Embarrassingly Distributed Probes • W1 & W2 are Shipping Containers • Lots of bandwidth within a container – But less across containers • Limited Bandwidth  Sampling Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d 3/27/2009 DIMACS 6 ≈ 1990 Strong: 427M (Google) Powerful: 353M (Google) 3/27/2009 DIMACS Page Hits ≈ 1000x BNC freqs 7 Turney (and Know It All) 3/27/2009 DIMACS PMI + The Web: Better together 8 “It never pays to think until you’ve run out of data” – Eric Brill Moore’s Law Constant: Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) Data Collection Rates  Improvement Rates More data is better data! Fire everybody and spend the money on data 3/27/2009 DIMACS Quoted out of context No consistently best learner 9 Page Hits Estimates by MSN and Google More Freq (August 2005) Query Hits (MSN) Larger corpora  Larger counts  More signal Hits (Google) A 2,452,759,266 3,160,000,000 The 2,304,929,841 3,360,000,000 Kalevala 159,937 214,000 Griseofulvin 105,326 149,000 38,202 147,000 Saccade # of (English) documents D ≈ 1010 . Lots of hits even for very rare words. 3/27/2009 DIMACS Less Freq 10 Caution: Estimates ≠ Actuals Query Hits (MSN) America 150,731,182 393,000,000 15,240,116 66,000,000 America & China & Britain 235,111 6,090,000 America & China & Britain & Japan 154,444 23,300,000 America & China Hits (Google) These are just (quick-and-dirty) estimates (not actuals) Joint frequencies ought to decrease monotonically as we add more terms to the query. 3/27/2009 DIMACS 11 Rule of Thumb breaks down when there are strong interactions (Common for cases(Governator): of most interest) Query Planning Query Austria One-way Two-way Three-way Four-way 3/27/2009 Governor Hits (Google) Rule of Thumb 88,200,000 37,300,000 Schwarzenegger 4,030,000 Terminator 3,480,000 Governor & Schwarzenegger 1,220,000 Governor & Austria 708,000 Schwarzenegger & Terminator 504,000 Terminator & Austria 171,000 Governor & Terminator 132,000 Schwarzenegger & Austria 120,000 Governor & Schwarzenegger & Terminator 75,100 Governor & Schwarzenegger & Austria 46,100 Schwarzenegger & Terminator & Austria 16,000 Governor & Terminator & Austria 11,500 Governor & Schwarzenegger & Terminator & Austria DIMACS 6,930 12 Associations: PMI, MI, Cos, R, Cor… Summaries of Contingency Table W2 ~W2 W1 a b ~ W1 c Margins (aka doc freq) d f1 = a + b a : # of documents that contain both Word W1 and Word W2 b : # of documents that contain Word W1 but not Word W2 f2 = a + c D = a+b+c+d • Need just one more constraint • To compute table (& summaries) – 4 parameters: a, b, c, d – 3 constraints: f1, f2, D 3/27/2009 DIMACS 13 Postings  Margins (and more) (Postings aka Inverted File) Postings(w): A sorted list of doc IDs for w PIG … 13 25 33 … This pig is so cute … … saw a flying pig … Doc #13 Doc #25 … was raining pigs and eggs … Doc #33 Assume doc IDs are random 3/27/2009 DIMACS 14 Conventional Random Sampling (Over Documents) Sample contingency table W2 ~W2 Original contingency table W2 ~W2 W1 as bs W1 a b ~ W1 cs ds ~ W1 c d Ds = as + bs + cs + d s Sample Size 3/27/2009 aˆ MF D  as Ds Margin-Free Baseline DIMACS 15 Random Sampling • Over documents – Simple & well understood – But problematic for rare events • Over postings k as = a f    2 Undesirable – where f = |P| (P = postings, aka inverted file) – aka doc freq or margin 3/27/2009 DIMACS 16 W1 as bs ~ W1 cs ds Sketches >> Random Samples Best k as  a f Undesirable Better 3/27/2009 k as = a f DIMACS    2 17 Outline • Review random sampling – and introduce a running example • Sample: Sketches – A generalization of Broder’s Original Method – Sketches: • Advantages: Larger as than random sampling • Disadvantages: Estimation  more challenging • Estimation: Maximum Likelihood (MLE) • Evaluation 3/27/2009 DIMACS 18 Random Sampling over Documents W2 ~W2 W1 as bs ~ W1 cs ds • Doc IDs are random integers between 1 and D=36 • Small circles  word W1 • Small squares  word W2 • Choose a sample size: Ds = 18. Sampling rate = Ds/D = 50% • Construct sample contingency table: as = #|{4,15}| = 2, bs = #|{3, 7, 9, 10, 18}| = 5, cs = #|{2,5,8}| = 3, ds = #|{1,6,11,12,13,14,17}|= 8 • Estimation: a ≈ D/Ds as • But that doesn’t take advantage of margins 3/27/2009 DIMACS 19 Proposed Sketches Sketch = Front of Postings Postings P1: 3 4 7 9 10 15 18 19 24 25 28 33 P2: 2 4 5 8 15 19 21 24 27 28 31 35 Throw out red Choose sample size: Ds = 18 = min(18, 21) W2 as = |{4,15}| = 2 Based on bs = 7 – as = 5 as W1 blue − red cs = 5 – as = 3 ds = Ds – as – bs – cs = 8 cs ~W 3/27/2009 DIMACS 1 ~W2 bs ds 20 • Estimation: Maximum Likelihood (MLE) D aˆ MF  as Consider all Ds possible contingency tables: When we know the margins, We ought to use them aˆ MLE = arg max P(as , bs , cs , d s | Ds ; a) a – a, b, c & d Pas , bs , cs , d s | Ds ; a  • Select the  a  b  c  d   D  table that =        maximizes the  as  bs  cs  d s   Ds  probability of  a  f1  a  f 2  a  D  f1  f 2 + a   D  observations =        – as, bs, cs & ds 3/27/2009  as  bs  DIMACS cs  ds   Ds  21 Exact MLE First derivative of log likelihood  log Pas , bs , cs , d s | Ds ; a  a as 1 cs 1 d s 1 1 bs 1 1 1 1 = + + + i =0 a  i i = 0 f1  a  i i =0 f 2  a  i i = 0 D  f1  f 2 + a  i  log Pas , bs , cs , d s | Ds ; a  =0 a Problem: 3/27/2009 gives the MLE solution Too complicated. Numerical problems. DIMACS 22 Exact MLE Second derivative:  log Pas , bs , cs , d s | Ds ; a  0 2 a 2 Log likelihood function is concave  unique maximum PMF updating formula: Pas , bs , cs , d s | Ds ; a  = Pas , bs , cs , d s | Ds ; a  1 g (a) Suffice to solve g(a) = 1. 3/27/2009 DIMACS 23 Exact MLE MLE solution: g (a) a f1  a + 1  bs f 2  a + 1  cs D  f1  f 2 + a = a  as f1  a + 1 f 2  a + 1 D  f1  f 2 + a  d s =1 is a cubic function of a. 3/27/2009 DIMACS 24 An Approximate MLE Suppose we were sampling from the two inverted files directly and independently. x ~x y ~y as bs cs ds nx = aPs (a+ ,b bs; a) s s Ds = as + bs + cs + ds nP = a,cs ;+a)cs y (a s s Approximate MLE: Maximize P(as, bs, cs; a) = P(as, bs; a) × P(as,cs; a) 3/27/2009 DIMACS 25 An Approximate MLE Convenient Closed-Form Solution Pr  as , bs , cs | a   a 2 as  fx  a  f y  a bs cs • Convenient Closed-Form 2as bs cs Take log of both sides;   = 0 • Surprisingly accurate a fx  a f y  a Set derivative = 0 • Recommended aˆMLE ,i = f x  2as + cs  + f y  2as + bs    f  2a + c  + f  2a + b   x s s y s s 2  8 f x f y as  2as + bs + cs  2  2as + bs + cs  3/27/2009 DIMACS 26 Margin-Free Baseline Independence Baseline Evaluation When we know the margins, We ought to use them Proposed Best 3/27/2009 DIMACS 27 Theoretical Evaluation • Not surprisingly, there is a trade-off between – Computational work: space, time – Statistical Accuracy: variance, error • Formulas state trade-off precisely in terms of sampling rate: Ds/D • Theoretical evaluation: – Proposed MLE is better than Margin Free baseline – Confirms empirical evaluation 3/27/2009 DIMACS 28 How many samples are enough? Sampling rate to achieve cv = SE/a < 0.5 Larger D  Smaller sampling rate Cluster of 10k machines  A Single machine At web scale (D≈1010), sampling rate 10- 4 may suffice for “ordinary” words. 3/27/2009 DIMACS 29 Broder’s Sketch: Original & Minwise Estimate Resemblance (R) P1  P2 • Notation a R= = P1  P2 f1 + f 2  a – Words: w1, w2 – Postings: P1, P2 • Set of doc IDs – Resemblance: R – Random Permutation: π R = Pmin P1  = min P2  • Minwise Sketch: – Permute doc IDs k times: πk – For each πi, let mini(P) be smallest doc ID in πi(P) k 1 Rˆ =  min i P1  = min i P2  k i =1 • Original Sketch: – Sketches: K1, K2 • Set of doc IDs • (front of postings) – Permute doc IDs once: π – Let K=firstk(P) be the first k doc IDs in π(P) 3/27/2009 Throw out half K1 = first k ( P1 ) K 2 = first k  ( P2 ) aˆs = first k K1  K 2   K1  K 2 DIMACS 30 Multi-way Associations: Evaluation MSE relative improvement over margin-free baseline When we know the margins, we ought to use them Gains are larger for 2-way than multi-way Degree of freedom = 2m – (m+1), increases exponentially, suggesting margin constraints become less important as m increases. 3/27/2009 DIMACS 31 Conclusions (1 of 2) • When we know the margins, We ought to use them Estimating Contingency Tables: – Fundamental Problem • Practical app: – Estimating Page Hits for two or more words (Governator) – Know It All: Estimating Mutual Information from Page Hits • Baselines: – Independence: Ignore interactions (Awful) – Margin-Free: Ignore postings (Wasteful) – (≈2x) Broder’s Sketch (WWW 97, STOC 98, STOC 2002) • Throws out half the sample – (≈10x) Random Projections (ACL2005, STOC2002) • Proposed Method: – Sampling: like Broder’s Sketch, but throws out less • Larger as than random sampling – Estimation: MLE (Maximum Likelihood) • MF: Estimation is easy without margin constraints • MLE: Find most likely contingency table, – Given observations: as, bs, cs, ds 3/27/2009 DIMACS aˆ MF D  as Ds 32 Rising Tide of Data Lifts All Boats Conclusions (2 of 2) If you have a lot of data, then you don’t need a lot of methodology • Recommended Approximation • Trade-off between – Computational work (space and time) and – Statistical accuracy (variance and errors) • Derived formulas for variance – Showing how trade-off depends on sampling rate • At Web scales, sampling rate (Ds/D)  10–4 – A cluster of 10k machines  A single machine aˆ MLE ,a = f1 2as + cs  + f 2 2as + bs   3/27/2009  f1 2as + cs  + f 2 2as + bs 2  8 f1 f 2 as 2as + bs + cs  22as + bs + cs  DIMACS 33 Backup Comparison with Broder’s Algorithm Broder’s Method has larger variance (≈2x) Because it uses only half the sketch  Var Rˆ MLE Var Rˆ B     f1 f 2  k max  ,  k1 k 2    f1 + f 2 Var(RMLE) << VAR(RB) Equal samples  max  f1 , f 2  , if k1 = k 2 = k  f +f  1 2 = f1 f2 1 , if k1 = 2k , k 2 = 2k  2 f1 + f 2 f1 + f 2 Proportional samples when a  min( f1, f 2 )  max( f1, f 2 )  D 3/27/2009 DIMACS 35 Comparison with Broder’s Algorithm Ratio of variances (equal samples) 3/27/2009 DIMACS Var(RMLE) << VAR(RB) 36 Comparison with Broder’s Algorithm Ratio of variances (proportional samples) Var(RMLE) << VAR(RB) 3/27/2009 DIMACS 37 Comparison with Broder’s Algorithm Estimation of Resemblance Broder’s method throws out half the samples  50% improvement 3/27/2009 DIMACS 38 Comparison with Random Projections Estimation of Angle Huge Improvement 3/27/2009 DIMACS 39 Comparison with Random Projections 10x Improvement 3/27/2009 DIMACS 40

Using Sketches to Estimate Associations

Related documents

Products

Support

Using Sketches to Estimate Associations

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib