Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR

advertisement
Statistical Translation and
Web Search Ranking
Jianfeng Gao
Natural language processing, MSR
July 22, 2011
Who should be here?
• Interested in statistical machine translation
and Web search ranking
• Interested in modeling technologies
• Look for topics for your master/PhD thesis
– A difficult topic: very hard to beat a simple
baseline
– An easy topic: others cannot beat it either
Outline
• Probability
• Statistical Machine Translation (SMT)
• SMT for Web search ranking
3
Probability (1/2)
• Probability space: π‘₯οƒŽπ‘‹
– 𝑃(π‘₯)οƒŽ[0, 1]
– οƒ₯π‘₯οƒŽπ‘‹π‘ƒ(π‘₯) = 1
– Cannot say 𝑃 π‘₯ > 𝑃 𝑦 if π‘₯ ∈ 𝑋 but 𝑦𝑋
• Joint probability: 𝑃(π‘₯, 𝑦)
– Probability that x and y are both true
• Conditional probability: 𝑃(𝑦|π‘₯)
– Probability that y is true when we already know x is true
• Independence: 𝑃(π‘₯, 𝑦) = 𝑃(π‘₯)𝑃(𝑦)
– x and y are independent
Probability (2/2)
• 𝐻: assumptions on which the probabilities are based
• Product rule –from the def of conditional probability
– 𝑃(π‘₯, 𝑦|𝐻) = 𝑃(π‘₯|𝑦, 𝐻)𝑃(𝑦|𝐻) = 𝑃(𝑦|π‘₯, 𝐻)𝑃(π‘₯|𝐻)
• Sum rule – a rewrite of the marginal probability def
– 𝑃(π‘₯|𝐻) = οƒ₯𝑦 𝑃(π‘₯, 𝑦|𝐻) = οƒ₯𝑦 𝑃(π‘₯|𝑦, 𝐻)𝑃(𝑦|𝐻)
• Bayes rule – from the product rule
– 𝑃(𝑦|π‘₯, 𝐻) =
𝑃
π‘₯ 𝑦, 𝐻 𝑃 𝑦 𝐻
𝑃 π‘₯𝐻
An example:
Statistical Language Modeling
Statistical Language Modeling (SLM)
• Model form
– capture language structure via a probabilistic
model
– Pr(π‘Š|𝓖) = 𝑃(π‘Š|𝓖) = 𝑃(π‘Š|𝓖, )
• Model parameters
– estimation of free parameters using training data
–  = argmax𝑃(π‘Š|𝓖, )
Model Form
• How to incorporate language structure into a
probabilistic model
• Task: next word prediction
– Fill in the blank: “The dog of our neighbor ___”
• Starting point: word n-gram model
– Very simple, yet surprisingly effective
– Words are generated from left-to-right
– Assumes no other structure than words
themselves
8
Word N-gram Model
• Word based model
– Using chain rule on its history (=preceding words)
𝑃 π‘‘β„Žπ‘’ π‘‘π‘œπ‘” π‘œπ‘“ π‘œπ‘’π‘Ÿ π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ π‘π‘Žπ‘Ÿπ‘˜π‘  = 𝑃 π‘‘β„Žπ‘’ < 𝑠 >
× π‘ƒ π‘‘π‘œπ‘” < 𝑠 >, π‘‘β„Žπ‘’
× π‘ƒ(π‘œπ‘“ | < 𝑠 >, π‘‘β„Žπ‘’, π‘‘π‘œπ‘”)
…
× π‘ƒ(π‘π‘Žπ‘Ÿπ‘˜π‘  | < 𝑠 >, π‘‘β„Žπ‘’, π‘‘π‘œπ‘”, π‘œπ‘“, π‘œπ‘’π‘Ÿ, π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ)
× π‘ƒ(</𝑠 > | < 𝑠 >, π‘‘β„Žπ‘’, π‘‘π‘œπ‘”, π‘œπ‘Ÿ, π‘œπ‘’π‘Ÿ, π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ, π‘π‘Žπ‘Ÿπ‘˜π‘ )
𝑃(𝑀1, 𝑀2 … 𝑀𝑛) = 𝑃(𝑀1 | < 𝑠 >)
× π‘ƒ(𝑀2 | < 𝑠 > 𝑀1)
× π‘ƒ(𝑀3 | < 𝑠 > 𝑀1 𝑀2)
…
× π‘ƒ(𝑀𝑛| < 𝑠 > 𝑀1 𝑀2 … 𝑀𝑛 − 1)
× π‘ƒ(</𝑠 > | < 𝑠 > 𝑀1 𝑀2 … 𝑀𝑛)
9
Word N-gram Model
• How do we get probability estimates?
– Get text and count! 𝑃 𝑀2 𝑀1 =
Count 𝑀1 ,𝑀2
Count 𝑀1
• Problem of using the whole history
– Rare events: unreliable probability estimates
– Assuming a vocabulary of 20,000 words,
model
# parameters
unigram P(w1)
20,000
bigram
P(w2|w1)
400M
trigram
P(w3|w1w2)
8 x 1012
fourgram P(w4|w1w2w3)
1.6 x 1017
From Manning and Schütze 1999: 194
Word N-gram Model
• Markov independence assumption
– A word depends only on N-1 preceding words
– N=3 → word trigram model
• Reduce the number of parameters in the model
– By forming equivalence classes
• Word trigram model
𝑃 𝑀𝑖 < 𝑠 > 𝑀1𝑀2 … 𝑀𝑖−2 𝑀𝑖−1 = 𝑃 𝑀𝑖 𝑀𝑖−2 𝑀𝑖−1 )
𝑃(𝑀1 𝑀2 … 𝑀𝑛) = 𝑃(𝑀1| < 𝑠 >)
× π‘ƒ(𝑀2| < 𝑠 > 𝑀1)
× π‘ƒ(𝑀3 | 𝑀1 𝑀2)
...
× π‘ƒ(𝑀𝑛 |𝑀𝑛−2 𝑀𝑛−1 )
× π‘ƒ(</𝑠 > |𝑀𝑛−1 𝑀𝑛)
11
Model Parameters
• Bayesian estimation paradigm
• Maximum likelihood estimation (MLE)
• Smoothing in N-gram language models
12
Bayesian Paradigm
• 𝑃(π‘šπ‘œπ‘‘π‘’π‘™|π‘‘π‘Žπ‘‘π‘Ž) =
–
–
–
–
𝑃
π‘‘π‘Žπ‘‘π‘Ž π‘šπ‘œπ‘‘π‘’π‘™
𝑃 π‘šπ‘œπ‘‘π‘’π‘™
𝑃 π‘‘π‘Žπ‘‘π‘Ž
𝑃(π‘šπ‘œπ‘‘π‘’π‘™|π‘‘π‘Žπ‘‘π‘Ž) – Posterior probability
𝑃(π‘‘π‘Žπ‘‘π‘Ž|π‘šπ‘œπ‘‘π‘’π‘™) – Likelihood
𝑃(π‘šπ‘œπ‘‘π‘’π‘™) – Prior probability
𝑃(π‘‘π‘Žπ‘‘π‘Ž) – Marginal probability
• Likelihood versus probability 𝑃 𝑛 𝑒, 𝑁
– for fixed 𝑒, 𝑃 defines a probability over 𝑛;
– for fixed 𝑛, 𝑃 defines the likelihood of 𝑒.
• Never say “the likelihood of the data”
• Always say “the likelihood of the parameters given the
data”
13
Maximum Likelihood Estimation (MLE)
• 𝛉: model; 𝑋: data
• 𝛉 = argmax𝑃(𝛉|𝑋) =
argmax𝑃
𝑋𝛉
𝑃 𝛉
𝑃 𝑋
– Assume a uniform prior 𝑃(𝛉) = π‘π‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘
– 𝑃(𝑋) is independent of 𝛉, and is dropped
• 𝛉 = argmax 𝑃(𝛉|𝑋)ο‚»argmax 𝑃(𝑋|𝛉)
– where 𝑃(𝑋|𝛉) is the likelihood of parameter
• Key difference between MLE and Bayesian Estimation
– MLE assume that 𝛉 is fixed but unknown,
– Bayesian estimation assumes that 𝛉 itself is a random
variable with a prior distribution 𝑃(𝛉)
14
MLE for Trigram LM
• 𝑃𝑀𝐿𝐸 (𝑀3|𝑀1 𝑀2) = Count
𝑀1 𝑀2 𝑀3
Count 𝑀1 𝑀2
• 𝑃𝑀𝐿𝐸 (𝑀2|𝑀1) = Count
𝑀1 𝑀2
Count 𝑀1
πΆπ‘œπ‘’π‘›π‘‘ 𝑀
𝑁
• 𝑃𝑀𝐿𝐸 (𝑀) =
• It is easy – let us get some real text and start to count
ο‚— But, why is this the MLE solution?
15
Derivation of MLE for N-gram
• Homework – an interview question of MSR 
• Hints
– This is a constrained optimization problem
– Use log likelihood as objective function
– Assume a multinomial distribution of LM
– Introduce Lagrange multiplier for the constraints
• ∑π‘₯οƒŽπ‘‹π‘ƒ π‘₯ = 1, and 𝑃 π‘₯ ≥ 0
16
Sparse Data Problem
• Say our vocabulary size is |V|
• There are |V|3 parameters in the trigram LM
– |V| = 20,000 οƒž 20,0003 = 8 ο‚΄ 1012 parameters
• Most trigrams have a zero count even in a large
text corpus
– Count(𝑀1 𝑀2 𝑀3) = 0
– 𝑃𝑀𝐿𝐸 (𝑀3|𝑀1 𝑀2) = Count
𝑀1𝑀2𝑀3
Count 𝑀1𝑀2
–𝑃 π‘Š =
𝑃𝑀𝐿𝐸 𝑀1 𝑃𝑀𝐿𝐸 𝑀2 𝑀1
– oops…
=0
𝑖=3…𝑛 𝑃(𝑀𝑖 |𝑀𝑖−2 𝑀𝑖−1 )
=0
17
Smoothing: Adding One
• Add one smoothing (from Bayesian paradigm)
• But works very badly – do not use this
• Add delta smoothing
• Still very bad – do not use this
18
Smoothing: Backoff
• Backoff trigram to bigram, bigram to unigram
ο‚— DοƒŽ(0,1) is a discount constant – absolute discount
ο‚— α is calculated so probabilities sum to 1 (homework)
• Simple and effective – use this one!
19
Outline
• Probability
• SMT and translation models
• SMT for web search ranking
20
SMT
C: 救援 δΊΊε‘˜ 在 ε€’ε‘Œηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
E: Rescue workers search for survivors in collapsed houses
𝐸 ∗ = argmax 𝑃(𝐸|𝐢)
𝐸
𝐸 ∗ = argmax 𝑃(𝐢|𝐸)𝑃(𝐸)
𝐸
𝑃(𝐢|𝐸) and 𝑃(𝐸|𝐢)
1
𝑃(𝐸|𝐢) =
exp
𝑍 𝐢, 𝐸
πœ†π‘– β„Žπ‘– (𝐢, 𝐸)
𝑖
𝑃(𝐸|𝐢)
• Translation process (generative story)
– C is broken into translation units
– Each unit is translated into English
– Glue translated units to form E
• Translation models
– Word-based models
– Phrase-based models
– Syntax-based models
Generative Modeling
Art
Story
Science
Math
Engineering
Code
Generative Modeling for 𝑃(𝐸|𝐢)
• Story making
– how a target sentence is generated from a source
sentence step by step
• Mathematical formulation
– modeling each generation steps in the generative
story using a probability distribution
• Parameter estimation
– implementing an effective way of estimating the
probability distributions from training data
Word-Based Models: IBM Model 1
• We first choose the length for the target sentence
𝐼, according to the distribution 𝑃(𝐼|𝐢).
• Then, for each position 𝑖 (𝑖 = 1 … 𝐼) in the target
sentence, we choose a position 𝑗 in the source
sentence from which to generate the 𝑖-th target
word 𝑒𝑖 according to the distribution 𝑃 𝑗 𝐢 .
• Finally, we generate the target word by
translating 𝑐𝑗 according to the distribution
𝑃(𝑒𝑖 |𝑐𝑗 ).
Mathematical Formulation
• Assume that the choice of the length is
independent of 𝐢 and 𝐼
– 𝑃(𝐼|𝐢) = πœ–
• Assume that all positions in the source sentence
are equally likely to be chosen
– 𝑃(𝑗|𝐢) =
1
𝐽+1
• Assuming that each target word is generated
independently from 𝐢
– 𝑃(𝐸|𝐢) = 𝑃(𝐼|𝐢)
𝑰
π’Š=𝟏 𝑃(𝑒𝑖 |𝐢)
Parameter Estimation
• Model Form
– 𝑃(𝐸|𝐢) =
πœ–
𝐽+1 𝐼
𝑰 ∑𝐽
π’Š=𝟏 𝑗=0 𝑃
𝑒𝑖 𝑐𝑗
• MLE on word-aligned training data
–𝑃 𝑒 𝑐 =
𝑁(𝑐,𝑒)
∑ ′ 𝑁(𝑐,𝑒′)
𝑒
• Don’t forget smoothing
Phrase-Based Models
Mathematical Formulation
• Assume a uniform probability over segmentations
– 𝑃 𝐸𝐢 ∝∑
𝑆,𝑇,𝑀 ∈ 𝑃
𝐡 𝐢,𝐸
𝑇 𝐢, 𝑆 ⋅ 𝑃 𝑀 𝐢, 𝑆, 𝑇
• Use the maximum approximation to the sum
– 𝑃 𝐸 𝐢 ≈ max 𝑃 𝑇 𝐢, 𝑆 ⋅ 𝑃 𝑀 𝐢, 𝑆, 𝑇
𝑆,𝑇,𝑀 ∈
𝐡 𝐢,𝐸
• Assume each phrase being translated independently
and use distance-based reordering model
– 𝑃 𝐸 𝐢 ∝ max
𝑆,𝑇,𝑀 ∈
𝐡 𝐢,𝑄
𝐾
π‘˜=1 𝑃(πžπ‘˜ |πœπ‘˜ )𝑑(π‘ π‘‘π‘Žπ‘Ÿπ‘‘π‘–
− 𝑒𝑛𝑑𝑖−1 − 1)
Parameter Estimation
MLE: 𝑃 𝐞 𝐜 =
𝑁 𝐜,𝐞
∑ ′ 𝑁 𝐜,𝐞′
𝐞
Don’t forget smoothing
Syntax-Based Models
Story
• Parse an input Chinese sentence into a parse
tree
• Translate each Chinese constituent into
English
– VP οƒ  (PP ε―»ζ‰Ύ NP, search for NP PP)
• Glue these English constituents into a wellformed English sentence.
Other Two Tasks?
• Mathematical formation
– Based on synchronous context free grammar (SCFG)
• Parameter estimation
– Learning SCFG from data
• Homework 
• Let us go thru an example (thanks to Michel
Galley)
– Hierarchical phrase model
– Linguistically syntax-based models
救援 δΊΊε‘˜ 在 ε€’ε‘Œ
ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
rescue
workers
search
for
survivors
in
collapsed
houses
ε€’ε‘Œ ηš„ ζˆΏε±‹
collapsed houses
救援 δΊΊε‘˜ 在 ε€’ε‘Œ
ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
rescue
workers
search
for
survivors
in
collapsed
houses
在 ε€’ε‘Œ ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
search for survivors in collapsed houses
救援 δΊΊε‘˜ 在 ε€’ε‘Œ
ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
rescue
workers
search
for
survivors
in
collapsed
houses
在 ε€’ε‘Œ ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
search for survivors in collapsed houses
A synchronous rule
在
ι‡Œ ε―»ζ‰Ύ
• Phrase-based translation unit
• Discontinuous translation unit
• Control on reordering
A synchronous grammar
在
ι‡Œ ε―»ζ‰Ύ
ε€’ε‘Œ ηš„ ζˆΏε±‹
η”ŸθΏ˜θ€…
Context-free derivation:
ι‡Œ ε―»ζ‰Ύ
search for
在 ε€’ε‘Œ ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ
search for
在 ε€’ε‘Œ ηš„ ζˆΏε±‹ ι‡Œ ε―»ζ‰Ύ η”ŸθΏ˜θ€…
search for survivors in collapsed houses
在
in
in collapsed houses
A synchronous grammar
在
ι‡Œ ε―»ζ‰Ύ
ε€’ε‘Œ ηš„ ζˆΏε±‹
η”ŸθΏ˜θ€…
Recognizes:
search for survivors in collapsed houses
search for collapsed houses in survivors
search for survivors collapsed houses in
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
JJ
NNS
NNS
NN
NNS
NP
VBP
PP
VBP
PP
VP
S
NP
PP
VP
NP
PP
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
JJ
NNS
NNS
NN
NNS
NP
VBP
PP
VBP
PP
VP
S
NP
PP
VP
NP
PP
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
JJ
NNS
NNS
NN
NNS
NP
VBP
PP
VBP
PP
VP
S
NP
PP
VP
NP
PP
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
NP
VBP
PP
VBP
PP
JJ
PP
NNS
NP
PP
VP
VP
VP
VP
PP ε―»ζ‰Ύ
NP
PP
VBP IN
search for NP
PP
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
NP
VBP
PP
VBP
PP
JJ
PP
NNS
NP
PP
VP
VP
SCFG rule:
VP-234
PP-32 ε―»ζ‰Ύ NP-57
search for NP-57
PP-32
rescue
staff
in
救援
δΊΊε‘˜
在
collapse of
ε€’ε‘Œ ηš„
Rescue workers search for
house
in
search
survivors
ζˆΏε±‹
ι‡Œ
ε―»ζ‰Ύ
η”ŸθΏ˜θ€…
survivors
IN
in collapsed houses.
NNS
JJ
NNS
NNS
NN
NNS
NP
VBP
PP
VBP
PP
VP
S
NP
PP
VP
NP
PP
Outline
• Probability
• SMT and translation models
• SMT for web search ranking
47
Web Documents and Search Queries
•
•
•
•
cold home remedy
cold remeedy
flu treatment
how to deal with
stuffy nose?
Map Queries to Documents
• Fuzzy keyword matching
– Q: cold home remedy
– D: best home remedies for cold and flu
• Spelling correction
– Q: cold remeedies
– D: best home remedies for cold and flu
• Query alteration
– Q: flu treatment
– D: best home remedies for cold and flu
• Query/document rewriting
– Q: how to deal with stuffy nose
– D: best home remedies for cold and flu
• Where are we now?
Research Agenda (Gao et al. 2010, 2011)
• Model documents and queries as different languages
(Gao et al., 2010)
• Cast mapping queries to documents as bridging the
language gap via translation
• Leverage statistical machine translation (SMT)
technologies and infrastructures to improve search
relevance
Are Queries and Docs just Different
Languages?
• A large scale analysis, extending (Huang et al.
2010)
• Divide web collection into different fields, e.g.,
queries, anchor text, titles, etc.
• Develop a set of language models, each on
one n-gram datasets from a different field
• Measure language difference between
different fields (queries/docs) via perplexity
Microsoft Web N-gram Model
Collection (cutoff = 0)
• Microsoft web n-gram services.
http://research.microsoft.com/web-ngram
Perplexity Results
• Test set
– 733,147 queries from the May 2009 query log
• Summary
– Query LM is most predictive of test queries
– Title is better than Anchor in lower order but is worse in higher
order
– Body is in a different league
SMT for Document Ranking
• Given a query (q), doc (d) can be ranked by
how likely it is that q is rewritten from d,
𝑃(πͺ|𝐝)
how to deal with
stuffy nose?
• An example: phrasal statistical translation for
Web document ranking
Phrasal Statistical Translation for Ranking
d:
S:
T:
M:
q:
“cold home remedies”
[“cold”, “home remedies”]
[“stuffy nose”, “deal with”]
(1 οƒ  2, 2οƒ  1)
“deal with stuffy nose”
title
segmentation
translation
permutation
query
• Uniform probability over S: 𝑃(πͺ|𝐝) ≈
∑(𝑆,𝑇,𝑀) 𝑃 𝑇 𝐝, 𝑆 𝑃(𝑀|𝐝, 𝑆, 𝑇)
• Maximum approximation: 𝑃(πͺ|𝐝) ≈
max
(𝑆,𝑇,𝑀)∈𝐡(𝐝,πͺ)
𝑃 𝑇 𝐝, 𝑆 𝑃(𝑀|𝐝, 𝑆, 𝑇)
• Max probability assignment via dynamic programming: 𝑃(πͺ|𝐝) ≈
max ∗ 𝑃 𝑇 𝐝, 𝑆 and 𝑃 𝑇 𝐝, 𝑆 = π‘˜=1…𝐾 𝑃(πͺπ‘˜ |π°π‘˜ )
(𝑆,𝑇,𝑀)∈𝐡(𝐝,πͺ,𝐴 )
• Model training on query-doc pairs
Mine Query-Document Pairs from User Logs
how to deal with stuffy nose?
stuffy nose treatment
cold home remedies
NO CLICK
NO CLICK
http://www.agelessherbs.com/BestHome
RemediesColdFlu.html
Mine Query-Document Pairs from User Logs
how to deal with stuffy nose?
stuffy nose treatment
cold home remedies
Mine Query-Document Pairs from User Logs
how to deal with stuffy nose?
stuffy nose treatment
cold home remedies
QUERY (Q)
how to deal with stuffy nose
stuffy nose treatment
cold home remedies
……
go israel
skate at wholesale at pr
breastfeeding nursing blister baby
thank you teacher song
immigration canada lacolle
Title (T)
best home remedies for cold and flu
best home remedies for cold and flu
best home remedies for cold and flu
……
forums goisrael community
wholesale skates southeastern skate supply
clogged milk ducts babycenter
lyrics for teaching educational children s music
cbsa office detailed information
• 178 million pairs from 0.5 year log
Evaluation Methodology
• Measurement: NDCG, t-test
• Test set:
– 12,071 English queries sampled from 1-y log
– 5-level relevance label for each query-doc pair
– On a tail document sets (click field is empty)
• Training data for translation models:
– 82,834,648 query-title pairs
Baseline: Word-Based Models
(Berger&Lafferty, 99)
• Basic model:
• Mixture model:
• Learning translation probabilities from
clickthrough data
– IBM Model 1 with EM
Results
Sample IBM-1 word
translation probability
after EM training on
the Query-title pairs
Bilingual Phrases
• Notice that with context information, we have
less ambiguous translations
Results
• Ranking results
– All features
– Only phrase translation features
Why Do Bi-Phrases Help?
• Length distribution
• Good/bad examples
Generative Topic Models
Q: stuffy nose treatment
Q: stuffy nose treatment
D: cold home remedies
Topic
D: cold home remedies
• Probabilistic latent Semantic Analysis (PLSA)
– 𝑃 πͺ𝐝 =
π‘ž∈πͺ ∑𝑧 𝑃
π‘ž 𝝓𝑧 𝑃(𝑧|𝐝, 𝜽)
– d is assigned a single most likely topic vector
– q is generated from the topic vectors
• Latent Dirichlet Allocation (LDA) generalizes PLSA
– a posterior distribution over topic vectors is used
– PLSA = LDA with MAP inference
Bilingual Topic Model
• For each topic z: 𝝓πͺ𝑧 , 𝝓𝐝𝑧 ~ Dir(𝜷)
• For each q-d pair: 𝜽 ~ Dir(𝜢)
• Each
• Each
πͺ
q is generated by 𝑧 ~ 𝜽 and π‘ž ~ 𝝓𝑧
w is generated by 𝑧 ~ 𝜽 and 𝑀 ~ 𝝓𝒅𝑧
Log-likelihood of LDA Given Data
• 𝝓 and 𝜽: distribution of distribution
• LDA requires integral over 𝝓 and 𝜽
• This is the MAP approximation to LDA
MAP Estimation via EM
• Estimate (𝜽, π›Ÿπͺ , π›Ÿπ ) by maximizing joint log
likelihood of q-d pairs and the parameters
• E-Step: compute posterior probabilities
– 𝑃 𝑧 π‘ž, 𝜽πͺ,𝐝 ,𝑃 𝑧 𝑀, 𝜽πͺ,𝐝
• M-Step: update parameters using the
posterior probabilities
–𝑃 π‘ž
πͺ
𝝓𝑧
,𝑃 𝑀 𝝓𝒅𝑧 , 𝑃(𝑧|𝜽πͺ,𝐝 )
Posterior Regularization (PR)
• q and its clicked d are relevant, thus they
– Share same prior distribution over topics (MAP)
– Weight each topic similarly (PR)
• Model training via modified EM
– E-step: for each q-d pair, project the posterior
topic distributions onto a constrained set, where
the expected fraction of each topic is equal in q
and d
– M-step: update parameters using the projected
posterior probabilities
Topic Models for Doc Ranking
Evaluation Methodology
• Measurement: NDCG, t-test
• Test set:
– 16,510 English queries sampled from 1-y log
– Each query is associated with 15 docs
– 5-level relevance label for each query-doc pair
• Training data for translation models:
– 82,834,648 query-title pairs
Topic Model Results
Summary
• Probability
– Basics
– A case study of a probabilistic model: N-gram language model
• Statistical Machine Translation (SMT)
– Generative modeling (story οƒ  math οƒ  code)
– Word/phrase/syntax based models
• SMT for web search ranking
– View query and doc as different language
– Doc ranking via 𝑃(πͺ|𝐝)
– Word/phrase/topic based models
• Slides/doc will be available at
http://research.microsoft.com/~jfgao/
Main Reference
•
•
•
•
•
•
•
Berger, A., and Lafferty, J. 1999. Information retrieval as statistical translation.
In SIGIR, pp. 222-229.
Gao, J., He, X., and Nie, J-Y. 2010. Clickthrough-based translation models for
web search: from word models to phrase models. In CIKM, pp. 1139-1148.
Gao, J., Toutanova, K., and Yih, W-T. 2011. Clickthrough-based latent semantic
models for web search. In SIGIR.
Huang, J., Gao, J., Miao, J., Li, X., Wang, K., and Behr, F. 2010. Exploring web
scale language models for search query processing. In Proc. WWW 2010, pp.
451-460.
MacKay, David J. C. 2003. Information Theory, Inference and Learning
Algorithms. Cambridge: Cambridge University Press.
Manning, C., and H. Chutze. 1999. Foundations of statistical natural language
processing. MIT Press. Cambridge.
Philipp Koehn. Statistical Machine Translation. Cambridge University Press.
2009.
Download