Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign 1 Text Generation with Unigram LM (Unigram) Language Model p(w| ) Sampling Document … Topic 1: Text mining text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 Text mining paper … … Topic 2: Health food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Food nutrition paper 2 Estimation of Unigram LM (Unigram) Language Model p(w| )=? … 10/100 5/100 3/100 3/100 1/100 text ? mining ? assocation ? database ? … query ? … Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100) 3 Language Models for Retrieval (Ponte & Croft 98) Document Language Model … Text mining paper text ? mining ? assocation ? clustering ? … food ? … … Food nutrition paper Query = “data mining algorithms” ? Which model would most likely have generated this query? food ? nutrition ? healthy ? diet ? … 4 Ranking Docs by Query Likelihood Doc LM Query likelihood d1 d1 p(q| d1) d2 d2 p(q| d2) q p(q| dN) dN dN 5 Retrieval as Language Model Estimation • Document ranking based on query likelihood log p (q | d ) log p (w i | d ) i where , q w 1w 2 ...w n • Retrieval Document language model problem Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches 6 How to Estimate p(w|d)? • Simplest solution: Maximum Likelihood Estimator – P(w|d) = relative frequency of word w in d – What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about … 7 Language Model Smoothing (Illustration) P(w) Max. Likelihood Estimate p ML ( w ) count of w count of all words Smoothed LM Word w 8 A General Smoothing Scheme • All smoothing methods try to – discount the probability of words seen in a doc – re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words p seen (w | d ) p(w | d ) d p(w | C ) Discounted ML estimate if w is seen in d otherwise Collection language model 9 Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain TF weighting log p(q | d ) pseen ( wi | d ) [log wi d wi q d p( wi | C ) IDF weighting Doc length normalization (long doc is expected to have a smaller d) ] n log d log p(w i | C) i Ignore for ranking • Smoothing with p(w|C) TF-IDF + length norm. 10 Derivation of the Query Likelihood Retrieval Formula Discounted ML estimate Retrieval formula using the general smoothing scheme if w is seen in d p Seen ( w | d ) p(w | d ) d p ( w | C ) otherw ise 1 d p Seen ( w | d ) Reference language model w is seen p(w | C ) w is unseen log p ( q | d ) c( w, q ) log p ( w | d ) c( w, q ) log pSeen ( w | d ) c( w, q ) log pSeen ( w | d ) c( w, q ) log wV ,c ( w , q ) 0 wV ,c ( w , d ) 0, c ( w, q ) 0 wV ,c ( w , d ) 0 wV ,c ( w , d ) 0 c ( w, q ) 0 c( w, q ) log d p ( w | C ) wV , c ( w, q ) 0, c ( w, d ) 0 wV ,c ( w , q ) 0 pSeen ( w | d ) d p(w | C ) c ( w, q ) log d p ( w | C ) | q | log d c ( w, q ) log d p ( w | C ) wV , c ( w, q ) 0, c ( w, d ) 0 wV ,c ( w , q ) 0 c( w, q ) p ( w | C ) Key rewriting step Similar rewritings are very common when using LMs for IR… 11 Three Smoothing Methods (Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) p(w | d ) (1 )pml (w | d ) p(w | C) • Dirichlet prior (Bayesian): Assume pseudo counts p(w|C) p (w | d ) c ( w;d ) p ( w|C ) |d | |d | |d | pml ( w | d ) |d | p( w | C ) • Absolute discounting: Subtract a constant p (w | d ) max( c ( w;d ) , 0 ) |d |u p ( w|C ) |d | 12 Comparison of Three Methods Query Type Title Long JM 0.228 0.278 Dir 0.256 0.276 AD 0.237 0.260 Relative performance of JM, Dir. and AD precision 0.3 TitleQuery 0.2 LongQuery 0.1 0 JM DIR AD Method 13 The Need of Query-Modeling (Dual-Role of Smoothing) Keyword queries Verbose queries Why does query type affect smoothing sensitivity? 14 Another Reason for Smoothing Content words Query = “the pDML(w|d1): 0.04 pDML(w|d2): 0.02 algorithms 0.001 0.001 for 0.02 0.01 p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) data 0.002 0.003 mining” 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… After smoothing with p ( w | d ) 0 . 1 p DML ( w | d ) 0 . 9 p ( w | REF ), p ( q | d 1) p ( q | d 2 )! Query P(w|REF) Smoothed p(w|d1): Smoothed p(w|d2): = “the 0.2 0.184 0.182 algorithms for 0.00001 0.000109 0.000109 0.2 0.182 0.181 data mining” 0.00001 0.000209 0.000309 0.00001 0.000309 0.000409 15 Two-stage Smoothing Stage-1 Stage-2 -Explain unseen words -Explain noise in query -Dirichlet prior(Bayesian) -2-component mixture P(w|d) = (1-) c(w,d) +p(w|C) |d| + p(w|U) + User background model and can be automatically set through statistical estimation 16 What You Should Know • The basic idea of ranking docs by query likelihood (“the language modeling approach”) • How smoothing is connected with TF-IDF weighting and document length normalization • The basic idea of two-stage smoothing 18