Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany ACM SigIR ‘05 An Initial Example… TREC Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” Increased transportation 1.0 tunnel 1.0 retrieval robustness disasters 1.0 Count only the best match per document and expansion set d2 d1 transit highway train truck metro “rail car” car … 0.9 0.8 0.7 0.6 0.6 0.5 0.1 Increased efficiency Top-k-style query evaluations 1.0 tube 0.9 catastrophe Open terms only0.9 underground 0.8scans on newaccident on demand “Mont Blanc” 0.7 fire 0.7 No… threshold tuning flood 0.6 earthquake 0.6 “land slide” 0.5 … Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Rocchio, Robertson&Sparck-Jones, concept similarities, or other correlation measures ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 2 Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 3 Computational Model Vector space model with a Cartesian product space D1×…×Dm and a data set D D1×…×Dm m Precomputed local scores s(ti,d)∈ Di for all d∈ D e.g., tf*idf variations, probabilistic models (Okapi BM25), etc. typically normalized to s(ti,d)∈ [0,1] Monotonous score aggregation aggr: (D1×…×Dm ) (D1×…×Dm ) → + e.g., sum, max, product (using sum over log sij ), cosine (using L2 norm) Partial-match queries (aka. “andish”) Non-conjunctive query evaluations Weak local matches can be compensated Access model Disk-resident inverted index over large text corpus Inverted lists sorted by decreasing local scores Inexpensive sequential accesses to per-term lists: “getNextItem()” More expensive random accesses: “getItemBy(docid)” ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 4 No-Random-Access (NRA) Algorithm Corpus: d1,…,dn NRA(q,L): scan all lists Li (i = 1..m) in parallel // e.g., round-robin < d, s(ti ,d) > = Li.getNextItem() E(d) = E(d) {i} highi = s(ti ,d) worstscore(d) = ∑E(d) s(t ,d) 1. 2. 3. dd1d1 1 4. 5. s(t1,d1) = 0.7 … s(tm,d1) = 0.2 6. 7. 8. 9. Query q = 10. (transportation, tunnel disaster) 12. 11. 13. Inverted Index d78 transport 0.9 d64 tunnel 0.8 disaster d10 0.7 d23 0.8 d23 0.6 d78 0.5 d10 0.8 d10 0.6 d64 0.4 d1 0.7 d10 0.2 d99 0.2 d88 0.2 … d78 0.1 … d34 0.1 … bestscore(d) = worstscore(d) + ∑E(d) high if worstscore(d) > min-k then add d to top-k min-k = min{ worstscore(d’) | d’ top-k} else if bestscore(d) > min-k then candidates = candidates {d} if max {bestscore(d’) | d’ candidates} min-k then return top-k Rank Doc WorstBestRank WorstBestRank# Doc Docscore Worst-score Best## score score score score k=1 Naive Join-then-Sort 1 1 d78 0.91.4 2.42.0 Scan d78 Scan 1 d10 2.1 2.1 Scan depth 1 2 2 d64 0.8 2.41.9 in between depth depth23 2 d23 d78 1.4 1.4 2.0 2 3 3 O(mn d10 0.7 2.42.1 O(mn) and ) runtime 3 d64 d23 0.8 1.4 1.8 44 ACM SigIR ‘05 [Fagin et al., PODS ’01 Balke et al. VLDB ’00 Buckley&Lewit, SigIR ‘85] STOP! 0.7 d10 d64 1.2 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 2.1 2.0 5 Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 6 Dynamic & Self-tuning Query Expansions accident fire d78 d10 d11 d1 ... d92 d42 d32 d87... i 1..m max {sim(ti , tij ) s(tij , d )} tij exp( ti ) d42 d11 d92 d11 … incr. merge d42 d11 d92 d21 ... virtual list ~disaster disaster score(d ) : tunnel d95 d17 d11 d99 ... Best match score aggregation for combined term similarities and local scores transport d66 d93 d95 d101 ... Incrementally merge inverted lists Li1…Lim’ in descending order of local scores Dynamically add lists into set of active expansions exp(ti) Only touch short prefixes of each list, don’t need to open all lists top-k (transport, tunnel, ~disaster) Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning of term similarities in the expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 7 Incremental Merge Operator Index list meta data (e.g., histograms) Relevance feedback, Thesaurus lookups,… Initial high-scores Expansion terms ~t = {t1,t2,t3} Correlation measures, Large corpus statistics … sim(t, t1 ) = 1.0 t1 sim(t, t2 ) = 0.9 t2 sim(t, t3 ) = 0.5 t3 Expansion similarities Incremental Merge iteratively triggered by top-k operator sequential access “getNextItem()” ACM SigIR ‘05 ~t d78 d23 d10 0.9 0.8 0.8 d1 d88 0.4 0.3 0.9 0.4 d64 d23 d10 d12 d78 0.8 0.8 0.7 0.2 0.1 0.18 0.72 d11 d78 d64 d99 d34 0.9 0.9 0.7 0.7 0.6 0.45 0.35 ... ... ... d78 d23 d10 d64 d23 d10 d11 d78 d1 d88 0.9 0.8 0.8 0.72 0.72 0.63 0.45 0.45 0.4 0.3 ... Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 8 Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 9 Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider P Si | Si highi Approximate local score distribution using an equi-width histogram with n buckets freqi [k ] ACM SigIR ‘05 # docs d Li Li k k 1 with s(ti , d ) , n n Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 10 Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider P Si | Si highi Approximate local score distribution using an equi-width histogram with n buckets freqi [k ] # docs d Li k k 1 with s(ti , d ) , n n Li For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) m' P max{Si1 ,..., Sim '} | Sil highi 1 P Sil | Sil highi l 1 Alternatively, construct meta histogram for the active expansions 1 ~ freqi [k ] Li1 ... Li m' ACM SigIR ‘05 m' Lil l 1 freql [k ] Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 11 Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider P Si | Si highi Approximate local score distribution using an equi-width histogram with n buckets freqi [k ] # docs d Li k k 1 with s(ti , d ) , n n Li For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) m' P max{Si1 ,..., Sim '} | Sil highi 1 P Sil | Sil highi l 1 Alternatively, construct meta histogram for the active expansions 1 ~ freqi [k ] Li1 ... Li m' m' Lil l 1 freql [k ] For all d in the candidate queue Return current top-k Consider the convolution over local score distributions to predict scores list aggregated if candidate Drop d from candidate queue, if queue is empty! P i E (d ) Si mink worstscore(d ) | Si highi ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 12 Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 13 Incremental Merge for Multidimensional Phrases Top-k Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” sim(„fiber optic cable“, Propagates candidates in descending order of bestscore(d) values to provide monotonous upper score bounds Top-k fiber optics fiber d34 d12 d78 d7 d23 0.9 0.8 0.6 0.4 0.3 … d78 d23 d10 d1 d88 0.9 0.8 0.8 0.7 0.2 … d78 d17 d23 d5 d47 0.8 0.6 0.6 0.4 0.1 … d78 d23 d10 d1 d88 0.9 0.8 0.8 0.7 0.2 … Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing random access optic d41 d10 d75 d2 d23 0.9 0.7 0.5 0.3 0.1 … ACM SigIR ‘05 sim(„fiber optic cable“, „fiber optics“) = 0.8 Nested cable Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) Nested Top-k d14 d18 d1 d23 d32 Top-level top-k operator performs phrase tests only for the most promising items (random access) (Expensive predicates & minimal probes [Chang&Hwang, SIGMOD ‘02] ) Incr.Merge „fiber optic cable“) = 1.0 undersea 0.9 0.9 0.8 0.8 0.7 … Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator q = {undersea „fiber optic cable“} term-toposition index 14 Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 15 Experiments – Aquaint with Fixed Expansions Aquaint corpus of English news articles (528,155 docs) 50 “hard” queries from TREC 2004 Robust track WordNet expansions using a simple form of WSD Okapi-BM25 model for local scores, Dice coefficients as term similarities Fixed expansion technique (synonyms + first-order hyponyms) Title-only Baseline Join&Sort 2.5 4 2,305,637 NRA-Baseline 2.5 4 1,439,815 Join&Sort 35 118 20,582,764 NRA+Phrases, ε=0.0 35 118 NRA+Phrases, ε=0.1 35 Incr.Merge, ε=0.0 Incr.Merge, ε=0.1 0 9.4 432 KB 0.252 0.092 1.000 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000 118 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541 35 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000 35 118 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786 Static Expansions Dynamic Expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 16 Experiments – Aquaint with Fixed Expansions, cont’d Probabilistic Pruning Performance Incremental Merge vs. top-k with static expansions Epsilon controls pruning aggressiveness 0≤ε≤1 20,000,000 0.4 1.0 #Seq, incr. merge 18,000,000 #Seq, static expansion 0.9 16,000,000 #Rand, incr. merge 14,000,000 #Rand, static expansion 0.8 0.3 0.7 12,000,000 0.6 10,000,000 0.2 0.5 8,000,000 0.4 6,000,000 4,000,000 0.3 0.1 0.2 2,000,000 0.1 0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ACM SigIR ‘05 ε relPrec, incr. merge merge P@10, incr. relPrec, staticexpansion expansion P@10, static P@10, incr. merge MAP, incr. merge P@10, static expansion expansion staticmerge MAP, incr. MAP, MAP, static expansion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 0.7 0.8 0.9 1.0 ε 17 Conclusions & Ongoing Work Increased efficiency Incremental Merge vs. Join-then-Sort & top-k using static expansions Very good precision/runtime ratio for probabilistic pruning Increased retrieval robustness Largely avoids topic drifts Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) Scalability (see paper) Large expansions (m < 876 terms per query) on Aquaint Expansions for Terabyte collection (~25,000,000 docs) Efficient support for XML-IR (INEX Benchmark) Inverted lists for combined tag-term pairs e.g., sec=mining Efficiently supports child-or-descendant axis e.g., //article//sec//=mining Vague content & structure queries (VCAS) e.g., //article//~sec=~mining TopX-Engine, VLDB ’05 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 19 Thank you! ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing 20