9-lmir - University of Illinois at Urbana

Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign 1 Text Generation with Unigram LM (Unigram) Language Model  p(w| ) Sampling Document … Topic 1: Text mining text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 Text mining paper … … Topic 2: Health food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Food nutrition paper 2 Estimation of Unigram LM (Unigram) Language Model  p(w| )=? … 10/100 5/100 3/100 3/100 1/100 text ? mining ? assocation ? database ? … query ? … Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100) 3 Language Models for Retrieval (Ponte & Croft 98) Document Language Model … Text mining paper text ? mining ? assocation ? clustering ? … food ? … … Food nutrition paper Query = “data mining algorithms” ? Which model would most likely have generated this query? food ? nutrition ? healthy ? diet ? … 4 Ranking Docs by Query Likelihood Doc LM Query likelihood d1  d1 p(q| d1) d2  d2 p(q| d2) q p(q| dN) dN dN 5 Retrieval as Language Model Estimation • Document ranking based on query likelihood log p (q | d )   log p (w i | d ) i where , q  w 1w 2 ...w n • Retrieval Document language model problem  Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches 6 How to Estimate p(w|d)? • Simplest solution: Maximum Likelihood Estimator – P(w|d) = relative frequency of word w in d – What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about … 7 Language Model Smoothing (Illustration) P(w) Max. Likelihood Estimate p ML ( w )  count of w count of all words Smoothed LM Word w 8 A General Smoothing Scheme • All smoothing methods try to – discount the probability of words seen in a doc – re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words  p seen (w | d ) p(w | d )    d p(w | C ) Discounted ML estimate if w is seen in d otherwise Collection language model 9 Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain TF weighting log p(q | d )  pseen ( wi | d )  [log  wi  d wi q d p( wi | C ) IDF weighting Doc length normalization (long doc is expected to have a smaller d) ]  n log  d   log p(w i | C) i Ignore for ranking • Smoothing with p(w|C)  TF-IDF + length norm. 10 Derivation of the Query Likelihood Retrieval Formula Discounted ML estimate Retrieval formula using the general smoothing scheme if w is seen in d  p Seen ( w | d ) p(w | d )     d p ( w | C ) otherw ise  1 d  p Seen ( w | d ) Reference language model w is seen  p(w | C ) w is unseen log p ( q | d )   c( w, q ) log p ( w | d )  c( w, q ) log pSeen ( w | d )   c( w, q ) log pSeen ( w | d )   c( w, q ) log wV ,c ( w , q )  0  wV ,c ( w , d )  0, c ( w, q )  0  wV ,c ( w , d )  0  wV ,c ( w , d )  0 c ( w, q )  0  c( w, q ) log  d p ( w | C ) wV , c ( w, q )  0, c ( w, d ) 0  wV ,c ( w , q )  0 pSeen ( w | d )  d p(w | C )  c ( w, q ) log  d p ( w | C )   | q | log  d  c ( w, q ) log  d p ( w | C ) wV , c ( w, q )  0, c ( w, d )  0  wV ,c ( w , q )  0 c( w, q ) p ( w | C ) Key rewriting step Similar rewritings are very common when using LMs for IR… 11 Three Smoothing Methods (Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) p(w | d )  (1   )pml (w | d )   p(w | C) • Dirichlet prior (Bayesian): Assume pseudo counts p(w|C) p (w | d )  c ( w;d )   p ( w|C ) |d |    |d | |d |  pml ( w | d )  |d |  p( w | C ) • Absolute discounting: Subtract a constant  p (w | d )  max( c ( w;d )  , 0 )  |d |u p ( w|C ) |d | 12 Comparison of Three Methods Query Type Title Long JM 0.228 0.278 Dir 0.256 0.276 AD 0.237 0.260 Relative performance of JM, Dir. and AD precision 0.3 TitleQuery 0.2 LongQuery 0.1 0 JM DIR AD Method 13 The Need of Query-Modeling (Dual-Role of Smoothing) Keyword queries Verbose queries Why does query type affect smoothing sensitivity? 14 Another Reason for Smoothing Content words Query = “the pDML(w|d1): 0.04 pDML(w|d2): 0.02 algorithms 0.001 0.001 for 0.02 0.01 p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) data 0.002 0.003 mining” 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… After smoothing with p ( w | d )  0 . 1 p DML ( w | d )  0 . 9 p ( w | REF ), p ( q | d 1)  p ( q | d 2 )! Query P(w|REF) Smoothed p(w|d1): Smoothed p(w|d2): = “the 0.2 0.184 0.182 algorithms for 0.00001 0.000109 0.000109 0.2 0.182 0.181 data mining” 0.00001 0.000209 0.000309 0.00001 0.000309 0.000409 15 Two-stage Smoothing Stage-1 Stage-2 -Explain unseen words -Explain noise in query -Dirichlet prior(Bayesian) -2-component mixture   P(w|d) = (1-) c(w,d) +p(w|C) |d| + p(w|U) + User background model  and  can be automatically set through statistical estimation 16 What You Should Know • The basic idea of ranking docs by query likelihood (“the language modeling approach”) • How smoothing is connected with TF-IDF weighting and document length normalization • The basic idea of two-stage smoothing 18

9-lmir - University of Illinois at Urbana

Related documents

Products

Support

9-lmir - University of Illinois at Urbana

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib