Fuchun Peng Microsoft Bing 7/23/2010 1 Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks simmons college’s sports psychology (course) Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” • Can we automatically segment the query to recover the concepts? 2 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions 3 Supervised learning (Bergsma et al, EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation point ◦ Features: POS, web counts, the, and, … w1 w2 N • Problem: w3 Y w4 N w5 Y – Limited-range context – Features specifically designed for noun phrases 4 Manual Data Preparation ◦ Linguistic driven [San jose international airport] ◦ Relevance driven [San jose] [international airport] 5 3,4 MI MI(w1,w2) = P(w1w2) / P(w1)P(w2) 1,2 4,5 threshold 2,3 w1 w2 • Problem: w3 w4 w5 insert segment boundary w1w2 | w3w4w5 Iterative update – only captures short-range correlation (between adjacent words) – What about my heart will go on? 6 7 Assume the query is generated by independent sampling from a probability distribution of concepts: simmons college sports psychology P(sports psychology)=0.000002 > P(simmons college)=0.000016 P=0.000016×0.000002 unigram model P=0.000007×0.000006×0.000024 simmons college sports psychology P(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024 • Enumerate all possible segmentations; Rank by probability of being generated by the unigram model • How to estimate parameters P(w) for the unigram model? 8 We have ngram (n=1..5) counts in a web corpus ◦ 464M documents; L = 33B tokens ◦ Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP 9 Maximum Likelihood Estimate: PMLE(t) = #(t) / N Problem: ◦ #(potter and the goblet of) = 6765 ◦ P(potter and the goblet of) > P(harry potter and the goblet of fire)? Wrong! ◦ not prob. of seeing t in text, but prob. of seeing t as a self-contained concept in text 10 Query-relevant web corpus ngram longest matching count raw frequency harry harry potter harry potter and harry potter and the harry potter and the goblet harry potter and the goblet of harry potter and the goblet of fire ... … fire 1657108 277736 10436 51330 101 618 5783 … … 4200957 2003112 346004 68268 57832 6502 6401 5783 … … 4478774 Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) posterior prob. θ = argmax P(D|θ)P(θ) DL of corpus DL of parameters = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) 11 12 Three human-segmented datasets ◦ 3 data sets, for training, validation, and testing, 500 queries for each set Segmented by three editors A, B, C 13 Evaluation metric: ◦ Boundary classification accuracy w1 w2 N w3 Y w4 N w5 Y ◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy ◦ Segment accuracy: the percentage of segments being recovered Truth [abc] [de] [fg] Prediction: [abc] [de fg]: precision 14 15 16 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions 17 Phrase Proximity Boosting Phrase Level Query Expansion 18 Classifying a segment into one of three categories ◦ Strong concept: no word reordering, no word insertion/deletion Treat the whole segment as a single unit in matching and ranking ◦ Weak concept: allow word reordering or deletion/insertion Boost documents matching the weak concepts ◦ Not a concept Do nothing 19 Concept based BM25 ◦ Weighted by the confidence of concepts Concept based min coverage ◦ Weighted by the confidence of concepts 20 Phrase level replacement ◦ [San Francisco] -> [sf] ◦ [red eye flight] ->[late night flight] 21 Significant relevance boosting ◦ Affects 40% query traffic ◦ Significant DCG gain (1.5% for affected queries) ◦ Significant online CTR gain (0.5% over all) 22 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions 23 Data is segmentation is important for query segmentation Phrases are important for improving relevance 24 Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW 2008 25 26 Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments harry potter and the goblet of fire += 1 ... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling | ... ... potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01 • Technical difficulties 27 Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q=harry potter and the goblet of fire ... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ... harry potter and the goblet of fire += 1 the += 2 harry potter += 1 28 29 Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q= potter and the goblet ... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ... potter and the goblet += 1 the += 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q|2) 30