9-lmir - University of Illinois at Urbana

advertisement
Language Models for TR
(Lecture for CS410-CXZ Text Info Systems)
Feb. 25, 2011
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
1
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document
…
Topic 1:
Text mining
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
Text mining
paper
…
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
…
Food nutrition
paper
2
Estimation of Unigram LM
(Unigram) Language Model 
p(w| )=?
…
10/100
5/100
3/100
3/100
1/100
text ?
mining ?
assocation ?
database ?
…
query ?
…
Estimation
Document
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
A “text mining paper”
(total #words=100)
3
Language Models for Retrieval
(Ponte & Croft 98)
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
…
Food nutrition
paper
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
food ?
nutrition ?
healthy ?
diet ?
…
4
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
q
p(q| dN)
dN
dN
5
Retrieval as
Language Model Estimation
• Document ranking based on query
likelihood
log p (q | d )   log p (w i | d )
i
where , q  w 1w 2 ...w n
• Retrieval
Document language model
problem  Estimation of
p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
6
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood
Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a
word that has not been observed?
• If we want to assign non-zero probabilities to
such words, we’ll have to discount the
probabilities of observed words
• This is what “smoothing” is about …
7
Language Model Smoothing
(Illustration)
P(w)
Max. Likelihood Estimate
p ML ( w ) 
count of w
count of all words
Smoothed LM
Word w
8
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen
words will have a non-zero probability
• Most use a reference model (collection
language model) to discriminate unseen
words
 p seen (w | d )
p(w | d )  
 d p(w | C )
Discounted ML estimate
if w is seen in d
otherwise
Collection language model
9
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the
query likelihood retrieval formula, we obtain
TF weighting
log p(q | d ) 
pseen ( wi | d )
 [log 
wi  d
wi q
d
p( wi | C )
IDF weighting
Doc length normalization
(long doc is expected to have a smaller d)
]  n log  d 
 log p(w
i
| C)
i
Ignore for ranking
• Smoothing with p(w|C)  TF-IDF + length
norm.
10
Derivation of the Query Likelihood
Retrieval Formula
Discounted ML estimate
Retrieval formula using the
general smoothing scheme
if w is seen in d
 p Seen ( w | d )
p(w | d )  
  d p ( w | C ) otherw ise

1
d 
p Seen ( w | d )
Reference language model
w is seen

p(w | C )
w is unseen
log p ( q | d ) 

c( w, q ) log p ( w | d )

c( w, q ) log pSeen ( w | d ) 

c( w, q ) log pSeen ( w | d ) 

c( w, q ) log
wV ,c ( w , q )  0

wV ,c ( w , d )  0,
c ( w, q )  0

wV ,c ( w , d )  0

wV ,c ( w , d )  0
c ( w, q )  0

c( w, q ) log  d p ( w | C )
wV , c ( w, q )  0, c ( w, d ) 0

wV ,c ( w , q )  0
pSeen ( w | d )
 d p(w | C )

c ( w, q ) log  d p ( w | C ) 
 | q | log  d 
c ( w, q ) log  d p ( w | C )
wV , c ( w, q )  0, c ( w, d )  0

wV ,c ( w , q )  0
c( w, q ) p ( w | C )
Key rewriting step
Similar rewritings are very common when using LMs for IR…
11
Three Smoothing Methods
(Zhai & Lafferty 01)
• Simplified Jelinek-Mercer: Shrink uniformly
toward p(w|C)
p(w | d )  (1   )pml (w | d )   p(w | C)
• Dirichlet prior (Bayesian): Assume pseudo counts
p(w|C)
p (w | d ) 
c ( w;d )   p ( w|C )
|d | 


|d |
|d | 
pml ( w | d )  |d |  p( w | C )
• Absolute discounting: Subtract a constant 
p (w | d ) 
max( c ( w;d )  , 0 )  |d |u p ( w|C )
|d |
12
Comparison of Three Methods
Query Type
Title
Long
JM
0.228
0.278
Dir
0.256
0.276
AD
0.237
0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM
DIR
AD
Method
13
The Need of Query-Modeling
(Dual-Role of Smoothing)
Keyword
queries
Verbose
queries
Why does query type affect smoothing sensitivity?
14
Another Reason for Smoothing
Content words
Query = “the
pDML(w|d1):
0.04
pDML(w|d2):
0.02
algorithms
0.001
0.001
for
0.02
0.01
p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
data
0.002
0.003
mining”
0.003
0.004
Intuitively, d2 should
have a higher score,
but p(q|d1)>p(q|d2)…
So we should make p(“the”) and p(“for”) less different for all docs,
and smoothing helps achieve this goal…
After smoothing
with p ( w | d )  0 . 1 p DML ( w | d )  0 . 9 p ( w | REF ), p ( q | d 1)  p ( q | d 2 )!
Query
P(w|REF)
Smoothed p(w|d1):
Smoothed p(w|d2):
= “the
0.2
0.184
0.182
algorithms
for
0.00001
0.000109
0.000109
0.2
0.182
0.181
data
mining”
0.00001
0.000209
0.000309
0.00001
0.000309
0.000409
15
Two-stage Smoothing
Stage-1
Stage-2
-Explain unseen words
-Explain noise in query
-Dirichlet prior(Bayesian) -2-component mixture


P(w|d) = (1-)
c(w,d) +p(w|C)
|d|
+ p(w|U)
+
User background model
 and  can be automatically set through statistical estimation 16
What You Should Know
• The basic idea of ranking docs by query
likelihood (“the language modeling approach”)
• How smoothing is connected with TF-IDF
weighting and document length normalization
• The basic idea of two-stage smoothing
18
Download