Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Committee: Jamie Callan (Chair) Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime Carbonell Yiming Yang Bruce Croft (UMass) 1 What is Full-Text Retrieval • The task User Query Retrieval Engine Results User Document Collection • The Cranfield evaluation [Cleverdon 1960] – abstracts away the user, – allows objective & automatic evaluations 2 Where are We (Going)? • Current retrieval models – formal models from 1970s, best ones 1990s – based on simple collection statistics (tf.idf), no deep understanding of natural language texts • Perfect retrieval – Query: “information retrieval”, A: “… text search …” imply – – – Textual entailment (difficult natural language task) Searcher frustration [Feild, Allan and Jones 2010] Still far away, what have been holding us back? 3 Two Long Standing Problems in Retrieval • Term mismatch – [Furnas, Landauer, Gomez and Dumais 1987] – No clear definition in retrieval • Relevance (query dependent term importance – P(t | R)) – Traditionally, idf (rareness) – P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998] – Few clues about estimation • This work – connects the two problems, – shows they can result in huge gains in retrieval, – and uses a predictive approach toward solving both problems. 4 What is Term Mismatch & Why Care? • Job search – You look for information retrieval jobs on the market. They want text search skills. – cost you job opportunities, (50% even if you are careful) • Legal discovery – You look for bribery or foul play in corporate documents. They say grease, pay off. – cost you cases • Patent/Publication search – cost businesses • Medical record retrieval – cost lives 5 Prior Approaches • Document: – Full text indexing • Instead of only indexing key words – Stemming • Include morphological variants – Document expansion • Inlink anchor, user tags • Query: – Query expansion, reformulation • Both: – Latent Semantic Indexing – Translation based models 6 Main Questions Answered • Definition • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 7 Definition Importance Prediction Solution _ Definition of Mismatch P(t | Rq) Jobs mismatched Relevant (q) All relevant jobs Documents that contain t “retrieval” Collection _ mismatch (P(t | Rq)) == 1 – term recall (P(t | Rq)) Directly calculated given relevance judgments for q |{𝑑: 𝑡 ∉ 𝑑 & 𝑑 ∈ 𝑅𝑞 }| P(𝑡 | 𝑅𝑞 )= |𝑅𝑞 | [CIKM 2010] 8 Definition Importance Prediction Solution How Often do Terms Match? Term in Query P(t | R) Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments 0.9914 0.9831 0.6885 0.2821 0.1071 (Example TREC-3 topics) 9 Main Questions • Definition _ • P(t | R) or P(t | R), simple, • estimated from relevant documents, • analyze mismatch • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 10 Definition Importance: Theory Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document d Term recall – – – Idf (rareness) Term weight for Okapi BM25 Other advanced models behave similarly Used as effective features in Web search engines 11 Definition Importance: Theory Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document d Term recall – Idf (rareness) “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance 12 Main Questions • Definition • Significance • Theory (as idf & only part about relevance) • Practice? • Mechanism (what causes the problem) • Model and solution 13 Definition Importance: Practice: Mechanism Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document d Term recall – Idf (rareness) “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance 14 Definition Importance: Practice: Mechanism Prediction Solution Without Term Recall • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) 15 Definition Importance: Practice: Mechanism Prediction Solution Ground Truth (Term Recall) Query: prognosis/viability of a political third party party political third viability prognosis True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204 idf 2.513 2.187 5.017 7.471 2.402 Emphasis Wrong Emphasis 16 Definition Importance: Practice: Mechanism Prediction Solution Top Results (Language model) Query: prognosis/viability of a political third party 1. … discouraging prognosis for 1991 … 2. … Politics … party … Robertson's viability as a candidate … 3. … political parties … 4. … there is no viable opposition … 5. … A third of the votes … 6. … politics … party … two thirds … 7. … third ranking political movement… 8. … political parties … 9. … prognosis for the Sunday school … 10. … third party provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!) 17 Definition Importance: Practice: Mechanism Prediction Solution Without Term Recall • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) – False positives throughout rank list • especially detrimental at top rank – No term recall hurts precision at all recall levels • How significant is the emphasis problem? 18 Definition Importance: Practice: Mechanism Prediction Solution Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Emphasis 64% Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today 19 Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior: Personalization, WSD, structured • Mechanism (what causes the problem) • Model and solution 20 Definition Importance: Practice: Potential Prediction Solution Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Emphasis 64% Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 21 Definition Importance: Practice: Potential Prediction Solution True Term Recall Effectiveness • +100% over BIM (in precision at all recall levels) – [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) – This work • For a new query w/o relevance judgments, – Need to predict – Predictions don’t need to be very accurate to show performance gain 22 Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior, • +30 to 80% potential from term weighting • Mechanism (what causes the problem) • Model and solution 23 Definition Importance Prediction: Idea Solution How Often do Terms Match? Same term, different Recall Term in Query Oil Spills Varies 0 to 1 Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071 idf 5.201 2.010 2.010 1.647 6.405 Differs from idf (Examples from TREC 3 topics) 24 Definition Importance Prediction: Idea Solution Statistics Term recall across all query terms (average ~55-60%) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Term Recall P(t | R) TREC 3 titles, 4.9 terms/query average 55% term recall 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Term Recall P(t | R) TREC 9 descriptions, 6.3 terms/query average 59% term recall 25 Definition Importance Prediction: Idea Solution Statistics Term recall on shorter queries (average ~70%) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Term Recall P(t | R) TREC 9 titles, 2.5 terms/query average 70% term recall 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Term Recall P(t | R) TREC 13 titles, 3.1 terms/query average 66% term recall 26 Definition Importance Prediction: Idea Solution Statistics Query dependent (but for many terms, variance is small) Term Recall for Repeating Terms 364 recurring words from TREC 3-7, 350 topics 27 Definition Importance Prediction: Idea Solution P(t | R) vs. idf P(t | R) P(t | R) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 df/N 0.1 0 -0.5 P(t | R) vs. df/N (Greiff, 1998) 0 0.5 1 idf TREC 4 desc query terms 28 Definition Importance Prediction: Idea Solution Prior Prediction Approaches • Croft/Harper combination match (1979) – treats P(t | R) as a tuned constant, or estimated from PRF – when >0.5, rewards docs that match more query terms • Greiff’s (1998) exploratory data analysis – Used idf to predict overall term weighting – Improved over basic BIM • Metzler’s (2008) generalized idf – Used idf to predict P(t | R) – Improved over basic BIM • Simple feature (idf), limited success – Missing piece: P(t | R) = term recall = 1 – term mismatch 29 Definition Importance Prediction: Idea Solution What Factors can Cause Mismatch? • Topic centrality (Is concept central to topic?) – “Laser research related or potentially related to defense” – “Welfare laws propounded as reforms” • Synonyms (How often they replace original term?) – “retrieval” == “search” == … • Abstractness – “Laser research … defense” “Welfare laws” – “Prognosis/viability” (rare & abstract) 30 Main Questions • Definition • Significance • Mechanism • Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms • Model and solution 31 Definition Importance Prediction: Implement Solution Designing Features to Model the Factors • We need to – – Identify synonyms/searchonyms of a query term in a query dependent way • External resource? (WordNet, wiki, or query log) – – – Biased (coverage problem, collection independent) Static (not query dependent) Not easy, not used here • Term-term similarity in concept space! – Local LSI (Latent Semantic Indexing) Query Retrieval Engine Document Collection Results Results Top (500) Results Concept Space (150 dim) 32 Definition Importance Prediction: Implement Solution Synonyms from Local LSI Term limitation for US Insurance Coverage Vitamin the cure or cause Congress members which pays for Long of human ailments Term Care P(t | Rq) 0.9831 0.6885 0.1071 Similarity with query term 0.5 0.4 0.3 0.2 0.1 0 Top Similar Terms 33 Definition Importance Prediction: Implement Solution Synonyms from Local LSI Term limitation for US Insurance Coverage Vitamin the cure or cause Congress members which pays for Long of human ailments Term Care P(t | Rq) 0.9831 0.6885 0.1071 Similarity with query term 0.5 (1) Magnitude of self similarity – Term centrality 0.4 0.3 (2) Avg similarity of supporting terms – Concept centrality 0.2 0.1 (3) How likely synonyms replace term t in collection 0 Top Similar Terms 34 Definition Importance Prediction: Experiment Solution Features that Model the Factors Correlation with P(t | R) 0.3719 idf: – 0.1339 • Term centrality – Self-similarity (length of t) after dimension reduction • Concept centrality 0.3758 – Avg similarity of supporting terms (top synonyms) • Replaceability – 0.1872 – How frequently synonyms appear in place of original query term in collection documents – 0.1278 • Abstractness – Users modify abstract terms with concrete terms effects on the US educational program prognosis of a political third party 35 Definition Importance Prediction: Implement Solution Prediction Model Regression modeling – Model: M: <f1, f2, .., f5> P(t | R) – Train on one set of queries (known relevance), – Test on another set of queries (unknown relevance) – RBF kernel Support Vector regression 36 Definition Importance Prediction Solution A General View of Retrieval Modeling as Transfer Learning • The traditional restricted view sees a retrieval model as – a document classifier for a given query. • More general view: A retrieval model really is – a meta-classifier, responsible for many queries, – mapping a query to a document classifier. • Learning a retrieval model == transfer learning – Using knowledge from related tasks (training queries) to classify documents for a new task (test query) – Our features and model facilitate the transfer. – More general view more principled investigations and more advanced techniques 37 Definition Importance Prediction: Experiment Solution Experiments • Term recall prediction error – L1 loss (absolute prediction error) • Term recall based term weighting retrieval – Mean Average Precision (overall retrieval success) – Precision at top 10 (precision at top of rank list) 38 Definition Importance Prediction: Experiment Solution Term Recall Prediction Example Query: prognosis/viability of a political third party. (Trained on TREC 3) party political third viability prognosis True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204 Predicted 0.6523 0.6236 0.3080 0.2869 0.7585 Emphasis 39 Definition Importance Prediction: Experiment Solution Term Recall Prediction Error Average Absolute Error (L1 loss) on TREC 4 0.35 The lower, the better 0.3 0.25 0.2 0.15 0.1 0.05 0 Average (constant) IDF only All 5 features Tuning metaparameters TREC 3 recurring words L1 Loss: 40 Main Questions • Definition • Significance • Mechanism • Model and solution • Can be predicted; Framework to design and evaluate features 41 Definition Importance Prediction Solution: Weighting Using 𝑃 (t | R) in Retrieval Models • In BM25 – Binary Independence Model • In Language Modeling (LM) – Relevance Model [Lavrenko and Croft 2001] Only term weighting, no expansion. 42 Definition Importance Prediction Solution: Weighting Predicted Recall Weighting MAP 0.25 10-25% gain (MAP) * * * Baseline LM desc Necessity desc Recall LM LM desc * 0.2 * * 0.15 0.1 * 0.05 0 3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 Datasets: train -> test “*”: significantly better by sign & randomization tests 43 Definition Importance Prediction Solution: Weighting Predicted Recall Weighting 10-20% gain (top Precision) Prec@10 0.6 0.5 Baseline LM desc Necessity desc Recall LM LM desc * 0.4 ! ! ! 0.3 0.2 0.1 * 0 3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 Datasets: train -> test “*”: Prec@10 is significantly better. “!”: Prec@20 is significantly better. 44 Definition Importance Prediction Solution: Weighting vs. Relevance Model Relevance Model [Laverenko and Croft 2001] Term occurrence in top docs Unsupervised RM weight (x) ~ Term recall (y) Pm(t1 | R) ~ P(t1 | R) Pm(t2 | R) ~ P(t2 | R) 5-10% better than unsupervised Query Likelihood y 1 Predicted Term Recall TREC 3 TREC 7 TREC 13 0 0 0.5 1 x Relevance Model Weights (normalized) 45 Main Questions • Definition • Significance • Mechanism • Model and solution • Term weighting solves emphasis problem for long queries • Mismatch problem? 46 Definition Importance Prediction Solution: Expansion Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Emphasis 64% Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 47 Definition Importance Prediction Solution: Expansion Recap: Term Mismatch • Term mismatch ranges 30%-50% on average • Relevance matching can degrade quickly for multi-word queries • Solution: Fix every query term [SIGIR 2012] 48 Definition Importance Prediction Solution: Expansion Conjunctive Normal Form (CNF) Expansion Example keyword query: placement of cigarette signs on television watched by children Manual CNF: (placement OR place OR promotion OR logo OR sign OR signage OR merchandise) AND (cigarette OR cigar OR tobacco) AND (television OR TV OR cable OR network) AND (watch OR view) AND (children OR teen OR juvenile OR kid OR adolescent) – – – – Expressive & compact (1 CNF == 100s alternatives) Highly effective (this work: 50-300% over base keyword) Used by lawyers, librarians and other expert searchers But, tedious & difficult to create, little research 49 Definition Importance Prediction Solution: Expansion Diagnostic Intervention Query: placement of cigarette signs on television watched by children Diagnosis: Low 𝑃(𝑡|𝑅) terms placement of cigarette signs on television watched by children Expansion: CNF (placement OR place OR promotion OR sign OR signage OR merchandise) AND cigar AND television AND watch AND (children OR teen OR juvenile OR kid OR adolescent) High idf (rare) terms placement of cigarette signs on television watched by children CNF (placement OR place OR promotion OR sign OR signage OR merchandise) AND cigar AND (television OR tv OR cable OR network) AND watch AND children • Goal – Least amount user effort near optimal performance – E.g. expand 2 terms 90% of total improvement 50 Definition Importance Prediction Solution: Expansion Diagnostic Intervention Query: placement of cigarette signs on television watched by children Diagnosis: Low 𝑃(𝑡|𝑅) terms placement of cigarette signs on television watched by children Expansion: Original Bag of word query [ 0.9 (placement cigar television watch Expansion query children) 0.1 (0.4 place 0.3 promotion 0.2 logo 0.1 sign 0.3 signage 0.3 merchandise 0.5 teen 0.4 juvenile 0.2 kid 0.1 adolescent) ] High idf (rare) terms placement of cigarette signs on television watched by children Bag of word [ 0.9 (placement cigar television watch children) 0.1 (0.4 place 0.3 promotion 0.2 logo 0.1 sign 0.3 signage 0.3 merchandise 0.5 tv 0.4 cable 0.2 network) ] • Goal – Least amount user effort near optimal performance – E.g. expand 2 terms 90% of total improvement 51 Definition Importance Prediction Solution: Expansion Diagnostic Intervention (We Hope to) User Keyword query Diagnosis system (P(t | R) or idf) (child AND cigar) Expansion terms User expansion (child teen) Query formulation (CNF or Keyword) Problem query terms (child > cigar) Retrieval engine Evaluation (child OR teen) AND cigar 52 Definition Importance Prediction Solution: Expansion Diagnostic Intervention (We Hope to) User Keyword query Diagnosis system (P(t | R) or idf) (child AND cigar) Expansion terms User expansion (child teen) Query formulation (CNF or Keyword) (child OR teen) AND cigar Problem query terms (child > cigar) Retrieval engine Evaluation 53 Definition Importance Prediction Solution: Expansion We Ended up Using Simulation Full CNF Offline Expert user (child OR teen) AND (cigar OR tobacco) Keyword query Diagnosis system (P(t | R) or idf) (child AND cigar) Online simulation Expansion terms User expansion (child teen) Problem query terms Online simulation (child > cigar) Query formulation (CNF or Keyword) (child OR teen) AND cigar Retrieval engine Evaluation 54 Definition Importance Prediction Solution: Expansion Diagnostic Intervention Datasets • Document sets – TREC 2007 Legal track, 7 million tobacco company – TREC 4 Ad hoc track, 0.5 million newswire • CNF Queries, 50 topics per dataset – TREC 2007 by lawyers, TREC 4 by Univ. Waterloo • Relevance Judgments – TREC 2007 sparse, TREC 4 dense • Evaluation measures – TREC 2007 statAP, TREC 4 MAP 55 Definition Importance Prediction Solution: Expansion Results – Diagnosis P(t | R) vs. idf diagnosis Full Expansion Gain in retrieval (MAP) 100% 90% 80% 70% P(t | R) on TREC 2007 60% idf on TREC 2007 50% P(t | R) on TREC 4 40% idf on TREC 4 30% 20% 10% 0% 0 1 2 3 4 All # query terms selected No Expansion Diagnostic CNF expansion on TREC 4 and 2007 56 Definition Importance Prediction Solution: Expansion Results – Form of Expansion CNF vs. bag-of-word expansion Retrieval performance (MAP) 0.35 Similar level of gain in top precision 0.3 0.25 CNF on TREC 4 0.2 50% to 300% CNF on TREC 2007 Bag of word on TREC 2007 gain Bag of word on TREC 4 0.15 0.1 0.05 0 0 1 2 3 4 All # query terms selected P(t | R) guided expansion on TREC 4 and 2007 57 Main Questions • Definition • Significance • Mechanism • Model and solution • Term weighting for long queries • Term mismatch prediction diagnoses problem terms, and produces simple & effective CNF queries 58 Definition Importance Prediction: Efficiency Solution: Weighting Efficient P(t | R) Prediction • 3-10X speedup (close to simple keyword retrieval), while maintaining 70-90% of the gain • Predict using P(t | R) values from similar, previously-seen queries [CIKM 2012] 59 Definition Importance Prediction Solution Contributions • Two long standing problems: mismatch & P(t | R) • Definition and initial quantitative analysis of mismatch – Do better/new features and prediction methods exist? • Role of term mismatch in basic retrieval theory – Principled ways to solve term mismatch – What about advanced learning to rank, transfer learning? • Ways to automatically predict term mismatch – Initial modeling of causes of mismatch, features – Efficient prediction using historic information – Are there better analyses or modeling of the causes? 60 Definition Importance Prediction Solution Contributions • Effectiveness of ad hoc retrieval – Term weighting & diagnostic expansion – How to do automatic CNF expansion? – Better formalisms: transfer learning, & more tasks? • Diagnostic intervention – Mismatch diagnosis guides targeted expansion – How to diagnose specific types of mismatch problems or different problems (mismatch/emphasis/precision)? • Guide NLP, Personalization, etc. to solve the real problem – How to proactively identify search and other user needs? 61 Acknowledgements • Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft • Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui (Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Yi Zhang (and her group), Jin Young Kim, Yangbo Zhu, Runting Shi, Yi Wu, Hui Tan, Yifan Yanggong, Mingyan Fan, Chengtao Wen – Discussions & references & feedback • Reviewers: papers & NSF proposal • David Fisher, Mark Hoy, David Pane – • Maintaining the Lemur toolkit Andrea Bastoni and Lorenzo Clemente – Maintaining LSI code for Lemur toolkit • SVM-light, Stanford parser • TREC: data • NSF Grant IIS-1018317 • Xiangmin Jin, and my whole family and volleyball packs at CMU & SF Bay 62 END 63 Prior Definition of Mismatch • Vocabulary mismatch (Furnas et al., 1987) – How likely 2 people disagree in vocab choice – Domain experts disagree 80-90% of the times – Leads to Latent Semantic Indexing (Deerwester et al., 1988) – Query independent – = Avgq P(t | Rq) - to our query dependent definition of – can be reduced term mismatch 64 Knowledge How Necessity explains behavior of IR techniques • Why weight query bigrams 0.1, while query unigrams 0.9? – Bigram decreases term recall, weight reflects recall • Why Bigram not gaining stable improvements? – Term recall is more of a problem • Why using document structure (field, semantic annotation) not improving performance? – Improves precision, need to solve structural mismatch • Word sense disambiguation – Enhances precision, instead, should use in mismatch modeling! • Identify query term sense, for searchonym id, or learning across queries • Disambiguate collection term sense for more accurate replaceability • Personalization – – biases results to what a community/person likes to read (precision) may work well in a mobile setting, short queries 65 Why Necessity? System Failure Analysis • Reliable Information Access (RIA) workshop (2003) – Failure analysis for 7 top research IR systems • • • • – 11 groups of researchers (both academia & industry) 28 people directly involved in the analysis (senior & junior) >56 human*weeks (analysis + running experiments) 45 topics selected from 150 TREC 6-8 (difficult topics) Causes (necessity in various disguise) • • • • • • Emphasize 1 aspect, missing another aspect Emphasize 1 aspect, missing another term Missing either 1 of 2 aspects, need both Missing difficult aspect that need human help Need to expand a general term e.g. “Europe” Precision problem, e.g. “euro”, not “euro-…” (14+2 topics) (7 topics) (5 topics) (7 topics) (4 topics) (4 topics) 66 67 68 Local LSI Top Similar Terms Oil spills spill oil 0.5828 Insurance coverage which pays for long term care term term 0.3310 Term limitations Vitamin the cure for US Congress of or cause for members human ailments term term 0.3339 ail ail 0.4415 oil 0.4210 long 0.2173 limit 0.1696 health 0.0825 tank 0.0986 nurse 0.2114 ballot 0.1115 disease 0.0720 crude 0.0972 care 0.1694 elect 0.1042 basler 0.0718 water 0.0830 home 0.1268 care 0.0997 dr 0.0695 69 1.2 Error plot of necessity predictions 1 0.8 0.6 Probability 0.4 0.2 0 -0.2 Necessity truth -0.4 Predicted necessity -0.6 Prediction trend (3rd order polynomial fit) -0.8 70 Necessity vs. idf (and emphasis) 71 True Necessity Weighting TREC 4 6 8 9 10 12 14 Document collection disk 2,3 disk 4,5 d4,5 w/o cr .GOV .GOV2 Topic numbers 201-250 301-350 401-450 451-500 501-550 TD1-50 751-800 LM desc – Baseline 0.1789 0.1586 0.1923 0.2145 0.1627 0.0239 0.1789 LM desc – Necessity 0.2703 0.2808 0.3057 0.2770 0.2216 0.0868 0.2674 Improvement 51.09% 77.05% 58.97% 29.14% 36.20% 261.7% 49.47% p - randomization 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 p - sign test 0.0000 0.0000 0.0000 0.0005 0.0000 0.0000 0.0002 Multinomial-abs 0.1988 0.2088 0.2345 0.2239 0.1653 0.0645 0.2150 Multinomial RM 0.2613 0.2660 0.2969 0.2590 0.2259 0.1219 0.2260 Okapi desc – Baseline 0.2055 0.1773 0.2183 0.1944 0.1591 0.0449 0.2058 Okapi desc – Necessity 0.2679 0.2786 0.2894 0.2387 0.2003 0.0776 0.2403 LM title – Baseline N/A 0.2362 0.2518 0.1890 0.1577 0.0964 0.2511 LM title – Necessity N/A 0.2514 0.2606 0.2058 0.2137 0.1042 0.2674 WT10g 72 Predicted Necessity Weighting 10-25% gain (necessity weight) TREC train sets Test/x-validation LM desc – Baseline LM desc – Necessity Improvement Baseline P@10 Necessity Baseline P@20 Necessity 10-20% gain (top Precision) 3 4 0.1789 0.2261 26.38% 0.4160 0.4940 0.3450 0.4180 3-5 6 0.1586 0.1959 23.52% 0.2980 0.3420 0.2440 0.2900 3-7 8 0.1923 0.2314 20.33% 0.3860 0.4220 0.3310 0.3540 7 8 0.1923 0.2333 21.32% 0.3860 0.4380 0.3310 0.3610 73 Predicted Necessity Weighting (ctd.) TREC train sets Test/x-validation LM desc – Baseline LM desc – Necessity Improvement Baseline P@10 Necessity Baseline P@20 Necessity 3-9 10 0.1627 0.1813 11.43% 0.3180 0.3280 0.2400 0.2790 9 10 0.1627 0.1810 11.25% 0.3180 0.3400 0.2400 0.2810 11 12 0.0239 0.0597 149.8% 0.0200 0.0467 0.0211 0.0411 13 14 0.1789 0.2233 24.82% 0.4720 0.5360 0.4460 0.5030 74 vs. Relevance Model Relevance Model: #weight( 1-λ #combine( t1 t2 ) λ #weight( w1 t1 w 2 t2 w 3 t3 … ) ) 1 y 0.5 x 0 0 4 0.1789 0.2423 0.2215 0.2330 6 0.1586 0.1799 0.1705 0.1921 0.5 1 Supervised > Unsupervised (5-10%) Weight Only ≈ Expansion Test/x-validation LM desc – Baseline Relevance Model desc RM reweight-Only desc RM reweight-Trained desc x~y w1 ~ P(t1|R) w2 ~ P(t2|R) 8 0.1923 0.2352 0.2435 0.2542 8 0.1923 0.2352 0.2435 0.2563 10 0.1627 0.1888 0.1700 0.1809 10 0.1627 0.1888 0.1700 0.1793 12 0.0239 0.0221 0.0692 0.0534 14 0.1789 0.1774 0.1945 0.2258 75 Feature Correlation f1 Term f2 Con f3 Repl f4 DepLeaf f5 idf RMw 0.3719 0.3758 -0.1872 0.1278 -0.1339 0.6296 Predicted Necessity: 0.7989 (TREC 4 test set) ≈ 76 vs. Relevance Model MAP Weight Only ≈ Expansion 0.3 Supervised > Unsupervised (5-10%) 0.25 0.2 0.15 Baseline LM desc 0.1 0.05 0 Relevance Model desc RM Reweight-Only desc RM is unstable RM Reweight-Trained desc 3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 Datasets: train -> test 77 Definition Importance Prediction: Idea Solution Efficient Prediction of Term Recall • Currently: – slow query dependent features that requires retrieval • Can they be more effective and more efficient? – Need to understand the causes of the query dependent variation – Design a minimal set of efficient features to capture the query dependent variations 78 Causes of Query Dependent Variation (1) • Example • Cause – Different word sense 79 Causes of Query Dependent Variation (2) • Example • Cause – Different word use, e.g. term in phrase vs. not 80 Causes of Query Dependent Variation (3) • Example • Cause – Different Boolean semantics of the queries, AND vs. OR 81 Causes of Query Dependent Variation (4) • Example • Cause – Different association level with topic 82 Efficient P(t | R) Prediction (2) • Causes of P(t | R) variation of same term in different queries – Different query semantics: Canada or Mexico vs. Canada – Different word sense: bear (verb) vs. bear (noun) – Different word use: Seasonal affective disorder syndrome (SADS) vs. Agoraphobia as a disorder – Difference in association level with topic • Use historic occurrences to predict current – 70-90% of the total gain – 3-10X faster, close to simple keyword retrieval 83 Efficient P(t | R) Prediction (2) • Low variation of same term in different queries • Use historic occurrences to predict current – 3-10X faster, close to the slower method & real time 0.3 MAP * 0.25 0.2 * * Baseline LM desc Necessity LM desc Efficient Prediction 0.15 0.1 0.05 * 0 train -> test 3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 84 Using Document Structure • Stylistic: XML • Syntactic/Semantic: POS, Semantic Role Label • Current approaches – All precision oriented • Need to solve mismatch first? 85 Motivation • Search is important, information portal • Search is research worthy Retrieval User Query Results – SIGIR, WWW, CIKM, ASIST, ModelECIR, AIRS, – Search is difficult • Retrieval modeling difficulty >= sentence paraphrasing Document Collection – Since 1970s, but still not fully understood, basic problem like mismatch – Adapt to changing requirements of mobile, social and semantic Web • Modeling user’s needs Query User Activities Results Retrieval Model 86 Collections Online or Offline Study? • Controlling confounding variables – Quality of expansion terms – User’s prior knowledge of the topic – Interaction form & effort • Enrolling many users and repeating experiments • Offline simulations can avoid all these and still make reasonable observations 87 Simulation Assumptions • Real full CNFs to simulate partial expansions • 3 assumptions about user expansion process – Expansion of individual terms are independent of each other • A1: always same set of expansion terms for a given query term, no matter which subset of query terms get expanded. • A2: same sequence of expansion terms, no matter … – A3: Keyword query is re-constructed from the CNF query • Procedure to ensure vocabulary faithful to that of the original keyword description • Highly effective CNF queries ensure reasonable kw baseline 88 Take Home Message for Ordinary Search Users (people and software) 89 Be mean! Is the term Necessary for doc relevance? If not, remove, replace or expand. 90