Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com Roadmap The problem Database content modeling Database selection Summary 2 Metasearch – the problem ??? applied mathematics ??? applied mathematics Search results 3 Metasearch Engine Subproblems Database content modeling Database selection How does a Metasearch engine “perceive” the content of each database? Selectively issue the query to the “best” databases Query translation Different database has different query formats Result merging 4 “a+b” / “a AND b” / “title:a AND body:b” / etc. Query “applied mathematics” top-10 results from both science.com and nature.com, how to present? Database content modeling and selection: a simplified example A “content summary” of each database Selection based on # of mathing docs Assuming independence between words Total #: 60,000 Total #: 10,000 5 Word w # of documents that use w Pr(w) Word w # of documents that use w Pr(w) applied 4000 0.4 applied 200 0.00333 mathematics 2500 0.25 mathematics 300 0.005 10,000 0.4 0.25 = 1000 documents matches “applied mathematics” > 60,000 0.00333 0.005 = 1 documents matches “applied mathematics” Roadmap The problem Database content modeling Database selection Summary 6 Database content modeling able to obtain a full content summary - less storage demanding - fully cooperative database approximate the content summary via sampling - least storage demanding - non-cooperative database 7 able to replicate the entire text database - most storage demanding - fully cooperative database download part of a text database - more storage demanding - non-cooperative database Replicate the entire database E.g. 8 www.google.com/patents, replica of the entire USPTO patent document database Download a non-cooperative database Objective: download as much as possible Basic idea: “probing” (querying with short queries) and downloading all results Practically, can only issue a fixed # of probes (e.g., 1000 queries per day) “applied” Metasearch Engine 9 Search Interface “mathematics” A text database Harder than the “set-coverage” problem All docs in a database db as the universe Each probe corresponds to a subset Find the least # of subsets (probes) that covers db 10 “mathematics” or, the max coverage with a fixed # of subsets (probes) NP-complete assuming all docs are equal Greedy algo. proved to be the best-possible P-time approximation algo. Cardinality of each subset (# of matching docs for each probe) unknown! “applied” Pseudo-greedy algorithms [NPC05] Greedy-set-coverage: choose subsets with the max “cardinality gain” When cardinality of subsets is unknown Assume cardinality of subsets is the same across databases - proportionally Start with certain “seed” queries, adaptively choose query words within the docs returned 11 e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency Choice of probing words varies from database to database An adaptive method D(wi) – subsets returned by probe with word wi w1, w2, …, wn already issued n arg max n wi1 as a word used by D(wi ) D( wi 1 ) D( wi ) i 1 i 1 Rewritten as |db|Pr(wi+1) - |db|Pr(wi+1 Λ (w1V…V wn)) 12 Pr(w): prob. of w appearing in a doc of db An adaptive method (cont’d) How to estimate Pr̃(wi+1) Zipf’s law: -γ Pr(w) = α(R(w)+β) , R(w): rank of w in a descending order of Pr(w) Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db Interpolate: interpolated “P̃r(w)” 13 Pr(w) values for w1, w2, …, wn fitted Zipf’s law curve single words ranked by Pr(w) in the downloaded documents Obtain an exact content summary C(db) for a database db Statistics about words in db, e.g., df – document frequency, df mathematics 2500 applied 4000 research 1000 Standards and proposals for co-operative databases to follow to export C(db) STARTS [GCM97] 14 w Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc. SDARTS [GIG01] Initiated by Columbia U. Approximate the content-summary Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy Basic idea: probing and download sample docs [CC01] Example: df as the content summary statistics 3. Pick a single word as the query, probe the database Download a fraction of results, e.g., top-k If terminating condition unsatisfied, go to 1. 4. Output <w, df̃> based on the sample docs downloaded 1. 2. 15 Vocabulary coverage Can a small sample of docs cover the vocabulary of a big database? Yes, based on Heap’s law [Hea78]: β |W |= Kn n W 16 K β - # of words scanned - set of distinct words encountered - constant, typically in [10, 100] - constant, typically in [0.4, 0.6] Empirically verified [CC01] Estimate document frequency How to identify the df̃ of w in the entire database? w’ appearing in the sampled docs: need to estimate df̃ based on the docs sample Apply Zipf’s law & interpolate [IG02] 2. Rank w and w’ based on their frequency in the sample Curve-fit based on the true df of those w 3. Interpolate the estimated df̃ of w’ onto the fitted curve 1. 17 w used as a query during sampling: df typically revealed in search results What if db changes over time? So does its content summary C(db), and C̃(db) [INC05] Empirical study 152 Web databases, a snapshot downloaded weekly, for 1 year df as the statistics measure Kullback-Leibler divergence as the “change” measure 18 between the “latest” snapshot and the snapshot time t ago db does change! How do we model the change? When to resample, and ̃ get a new C(db) ? Kullback-Leibler divergence t Model the change KLdb(t) – the KL divergence between the current C̃(db) and C̃ (db, t) time t ago T: time when KLdb(t) exceeds a pre-specified τ Applying principles of Survival Analysis 19 Survival function Sdb(t) = 1 – Pr(T ≤ t) Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t) How to compute hdb(t) and then Sdb(t)? Learn the hdb(t) of database change Cox proportional hazards regression model ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable Predictors Pre-specified threshold τ Web domain of db, “.com” “.edu” “.gov” “.org” “others” 20 5 binary “domain variables” ln( |db| ) avg KLdb(1 week) measured in the training period … Train the Cox model Stratified Cox model being applied Training Sbase(t) for each domain 21 Domain variables didn’t satisfy the Cox proportional assumption Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain Assuming Weibull distribution, Sbase(t) = e-λtγ Training result γ ranges in (0.57, 1.08) Sbase(t) not exponential distribution Sbase(t) t 22 Training result (cont’d) predictor ln( |db| ) avg KLdb(1 week) β value 0.094 6.762 τ -1.305 A larger db takes less time to have KLdb(t) exceed τ Databases changes faster during a short period are more likely to change later on 23 How to use the trained model? Model gives Sdb(t) likelihood that db “has not changed much” An update policy to periodically resample each db Intuitively, maximize ∑db Sdb(t) More precisely t – S = limt∞ (1/t)∫0 [ ∑db Sdb(t) ] dt A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week Subject to practical constraints, e.g., total update cap per week 24 Derive an optimal update policy – Find {fdb} that maximizes S under the constraint ∑db fdb = F, where F is a global frequency limit Solvable by the Lagrange-multiplier method Sample results: db 25 λ F=4/week F=15/week tomshardware.com 0.088 1/46 1/5 Usps.com 1/12 0.023 1/34 Roadmap The problem Database content modeling Database selection Summary 26 Database selection Select the databases to issue a given query Formalization 27 Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only Reduces query load in the entire system Query q = <w1, …, wm>, databases db1, …, dbn Rank databases according to their “relevancy score” r(dbi, q) to query q Relevancy score # of matching docs in db Similarity between q and top docs returned by db Relevancy of db as judged by users 28 Typically vector-space similarity (dot-product) between q and a doc Sum / Avg of similarities of top-k docs of each db, e.g., top-10 Sum / Avg of similarities of top docs of each db exceeding a similarity threshold Explicit relevance feedback User click behavior data Estimating r(db,q) Typically, r(db, q) unavailable Estimate r(db, q) based on C(db), or C̃ (db) ̃ 29 Estimating r(db,q), example 1 [GGT99] r(db, q): # of matching docs in db Independence assumption: Query words w1, …, wm appear independently in db r(db, q): ̃ df (db,w j ) ~ r (db,q)=|db|× ∏ |db| w j∈q df(db, wj): document frequency of wj in db – could be df̃(db, wj) from C̃(db) 30 Estimating r(db,q), example 2 [GGT99] r(db, q): ∑{ddb | sim(d, q)>l} sim(d, q) d: a doc in db sim(d, q): vector dot-product between d & q 31 each word in d & q weighted with common tfidf weighting l: a pre-specified threshold Estimating r(db,q), example 2 (cont’d) Content summary, C(db), required: df(db, w): doc frequency – v(db, w): ∑{ddb} weight of w in d’s vector 32 – – <v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors” Estimating r(db,q), example 2 (cont’d) l = 0, sum of all q-doc similarity values of db r(db, q) = ∑{ddb} sim(d, q) r(db, q) = r(db, q) = ̃ – <v(q,w1), …> <v(db, w1), – v(db, w2), …> 33 v(q, w): weight of w in the query vector l > 0? Estimating r(db,q), example 2 (cont’d) Assuming uniform weight of w among all docs using w Highly-correlated query words scenario 34 i.e. weight of w in any doc = – v(db, w) / df(db, w) If df(db, wi) < df(db, wj), every doc using wi also uses wj Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm) – r(db, q) = ∑i=1…pv(q, wi)v(db, wi) + ̃ – df(db, wp) [ ∑j=p+1…mv(q, wj)v(db, wj)/df(db, wj)] where p is determined by some criteria [GGT99] Disjoint query words scenario No doc using wi uses wj – r(db, q) = ∑i=1…m | df(db, w ) > 0 Λ v(q, w )–v(db, w ) / df(db, w ) > l v(q, wi)v(db, wi) ̃ i i i i Estimating r(db,q), example 2 (cont’d) 35 Ranking of databases based on r(db, q) ̃ empirically evaluated [GGT99] A probabilistic model for errors in estimation [LLC04] Any estimation makes errors An error (observed) distribution for each db distribution of db1 ≠ distribution of db2 Definition of error: relative r (db, q) - ~ r (db, q) err (db, q) = ~ r (db, q) 36 Modeling the errors: a motivating experiment dbPMC: PubMedCentral www.pubmedcentral.nih.gov Two query sets, Q1 and Q2 (healthcare related) |Q1| = |Q2| = 1000, Q1 Q2 = Compute err(dbPMC, q) for each sample query q Q1 or Q2 Further verified through statistical tests (Pearson-χ2) error probability distribution 37 Q1 error probability distribution err(dbPMC, q), q Q1 Q2 err(dbPMC, q), q Q2 Implications of the experiment On a text database Sampling size study [LLC04] 38 Similar error behavior among sample queries Can sample a database and summarize the error behavior into an Error Distribution (ED) Use ED to predict the error for a future unseen query A few hundred sample queries good enough From an Error Distribution (ED) to a Relevancy Distribution (RD) ① Database: db1. Query: qnew r (db1 ,qnew ) = (err (db1 ,qnew ) +1) × ~ r (db1 ,qnew ) 0.4 0.5 0.1 err(db1,qnew) -50% by definition 0% +50% from sampling 0.5 0.4 0.1 ② The ED for db1 ③ 39 r(̃ db1,qnew) =1000 500 1000 r(db1,qnew) 1500 ④ A Relevancy Distribution (RD) existing estimation method for r(db1, qnew) RD-based selection Estimation-based: db1 > db2 r(̃ db2,qnew) 650 0.4 0.5 0.1 -50% 0% err(db1, qnew) +50% db1 ,q ) =1000 RD-based: db1 r<(̃ dbdb 2 ( Pr(db1 < db2) = 0.850.9) 1 0.1 db2 0% 1000 0.5 0.4 0.1 500 1000 r(db1, qnew) 1500 new err(db2, qnew) +100% r(̃ db1,qnew) =650 40 r(̃ db1,qnew) 0.9 0.1 db1: 650 db2: r(db2, qnew) 1300 Correctness metric Terminology: DBk: k databases returned by some method DBtopk: the actual answer How correct DBk is compared to DBtopk? Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk 0, otherwise 41 k topk | DB ∩ DB | Partial correctness: Corp(DBk) = k Cora(DBk) = Corp(DBk) for k = 1 Effectiveness of RD-based selection 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the ED of each database Q2 (testing, 1000 queries) to test the correctness of database selection k=1 k=3 Avg(Cora), Avg(Corp) Avg(Cora) Avg(Corp) Estimation-based selection (term-independence estimator) 0.471 0.301 0.699 RD-based selection 0.651 (+38.2%) 0.478 (+58.8%) 0.815 (+30.9%) 42 Probing to improve correctness db1: RD-based selection 0.85 = Pr(db2 > db1) = Pr({db2} = DBtop1) = 1Pr({db2} = DBtop1) + 0Pr({db2} DBtop1) = E[Cora({db2})] db2: 0.5 0.4 0.1 500 650 0.1 1000 1300 1500 Probe dbi: contact a dbi to obtain its exact relevancy After probing db1: E[Cora({db2})] = Pr(db2 > db1) = 1 43 0.9 r(db1,q)=500 Computing the expected correctness Expected absolute correctness E[Cora(DBk)] =1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0) = Pr(Cora(DBk) = 1) = Pr(DBk = DBtopk) Expected partial correctness E[Corp(DBk)] = ∑ 0 ≤l≤k 44 l l l • Pr (Cor p ( DBk ) = ) = • Pr (| DB k ∩DBtopk |= l ) k k 0 ≤l≤k k ∑ Adaptive probing algorithm: APro User-specified correctness threshold: t return this DBk dbi+1 YES dbn RD’s of the probed and unprobed databases dbi Any DBk with E[Cor(DBk)] t? unprobed probed NO db1 45 dbi-1 dbi dbi+1 dbn Which database to probe? A greedy strategy: The stopping condition: E[Cor(DBk)] t Once probed, which database leads to the highest E[Cor(DBk)]? Suppose we will probe db3 if r(db3,q) = ra, max E[Cor(DBk)] = 0.85 if r(db3,q) = rb, max E[Cor(DBk)] = 0.8 if r(db3,q) = rc, max E[Cor(DBk)] = 0.9 Probe the database that leads to the largest “expected” max E[Cor(DBk)] 46 db1 db2 db3 db4 r(db3, q) = rb r(db3, q) = ra r(db3, q) = rc rc rb ra Effectiveness of adaptive probing 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the RD of each database Q2 (testing, 1000 queries) to test the correctness of database selection avg Cora 1 1 1 0.9 0.9 0.9 0.8 avg Cora 0.7 0.6 0.8 avg Corp 0.7 0.6 0.8 0.7 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 adaptive probing APro 0.2 the term-independence estimator 0.1 0 0.2 adaptive probing APro 0.2 0.1 the term-independence estimator 0.1 0 0 47 0.3 0.3 1 2 3 4 5 adaptive probing APro the term-independence estimator 0 0 1 # of databases probed 2 3 4 # of databases probed k=1 k=3 5 0 1 2 3 4 # of databases probed k=3 5 The “lazy TA problem” Same problem, generalized & “humanized” After the final exam, the TA wants to find out the top scoring students TA is “lazy,” don’t want to score all exam sheets Input: every student’s score: a known distribution Output: a scoring strategy 48 Observed from pervious quiz, mid-term exams Maximizes the correctness of the “guessed” top-k students Further study of this problem [LSC05] Proves greedy probing is optimal under special cases More interesting factors to-be-explored: 49 “Optimal” probing strategy in general cases Non-uniform probing cost Time-variant distributions Roadmap The problem Database content modeling Database selection Summary 50 Summary Metasearch – a challenging problem Database content modeling Database selection 51 Sampling enhanced by proper application of the Zipf’s law, the Heap’s law Content change modeled using Survival Analysis Estimation of database relevancy based on assumptions A probabilistic framework that models the error as a distribution “Optimal” probing strategy for a collection of distributions as input References 52 [CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text Databases,” ACM Tran. on Information System, 19(2), 2001 [GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke, “STARTS: Stanford Proposal for Internet Meta-searching,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, 1997 [GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “GlOSS: Text Source Discovery over the Internet,” ACM Tran. on Database Systems, 24(2), 1999 [GIG01] N. Green, P. Ipeirotis, L. Gravano, “SDLIP+STARTS=SDARTS: A Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), 2001 [Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical Aspects, Academic Press, 1978 [IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” in Proc. of the 28th VLDB Conf., 2002 References (cont’d) 53 [INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and Managing Content Changes in Text Databases,” in Proc. of the 21st IEEE Int’l Conf. on Data Eng. (ICDE), 2005 [LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” in Proc. of the 20th IEEE Int’l Conf. on Data Eng. (ICDE), 2004 [LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual Symposium on Applied Computing, 2005 [NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), June 2005