Concepts Identification from Queries and Its Application for Search

advertisement
Fuchun Peng
Microsoft Bing
7/23/2010
1


Query is often treated as a bag of words
But when people are formulating queries,
they use “concepts” as building blocks
simmons
college’s
sports
psychology
(course)
Q: simmons college sports psychology
A1: “simmons college”, “sports psychology”
A2: “college sports”
• Can we automatically segment the query to
recover the concepts?
2


Summary of Segmentation approaches
Use for Improving Search Relevance
◦ Query rewriting
◦ Ranking features

Conclusions
3

Supervised learning (Bergsma et al,
EMNLP-CoNLL07)
◦ Binary decision at each possible segmentation
point
◦ Features: POS, web counts, the, and, …
w1
w2
N
• Problem:
w3
Y
w4
N
w5
Y
– Limited-range context
– Features specifically designed for noun phrases
4

Manual Data Preparation
◦ Linguistic driven
 [San jose international airport]
◦ Relevance driven
 [San jose] [international airport]
5
3,4
MI
MI(w1,w2) = P(w1w2) / P(w1)P(w2)
1,2
4,5
threshold
2,3
w1 w2
• Problem:
w3 w4
w5
insert segment boundary
w1w2 | w3w4w5
Iterative update
– only captures short-range correlation (between
adjacent words)
– What about my heart will go on?
6
7

Assume the query is generated by independent sampling
from a probability distribution of concepts:
simmons college sports psychology
P(sports psychology)=0.000002
>
P(simmons college)=0.000016
P=0.000016×0.000002
unigram model
P=0.000007×0.000006×0.000024
simmons college sports psychology
P(simmons)=0.000007
P(college sports)=0.000006 P(psychology)=0.000024
• Enumerate all possible segmentations; Rank by probability
of being generated by the unigram model
• How to estimate parameters P(w) for the unigram model?
8

We have ngram (n=1..5) counts in a web
corpus
◦ 464M documents; L = 33B tokens
◦ Approximate counts for longer ngrams are often
computable: e.g. #(harry potter and the goblet of
fire) is in [5783, 6399]
 #(ABC)=#(AB)+#(BC)-#(AB OR BC)
>= #(AB)+#(BC)-#(B)
Solved by DP
9

Maximum Likelihood Estimate:
PMLE(t) = #(t) / N

Problem:
◦ #(potter and the goblet of) = 6765
◦ P(potter and the goblet of) > P(harry potter and the
goblet of fire)? Wrong!
◦ not prob. of seeing t in text, but prob. of seeing t
as a self-contained concept in text
10
Query-relevant web corpus
ngram
longest
matching
count
raw
frequency
harry
harry potter
harry potter and
harry potter and the
harry potter and the goblet
harry potter and the goblet of
harry potter and the goblet of fire
...
…
fire
1657108
277736
10436
51330
101
618
5783
…
…
4200957
2003112
346004
68268
57832
6502
6401
5783
…
…
4478774
Choose parameters to maximize the posterior
probability given query-relevant corpus / minimize
the total description length)
t: a query substring
C(t): longest matching count of t
D = {(t, C(t)}: query-relevant corpus
s(t): a segmentation of t
θ: unigram model parameters (ngram probabilities)
posterior prob.
θ = argmax P(D|θ)P(θ)
DL of corpus DL of parameters
= argmax log P(D|θ) + log P(θ)
log P(D|θ) = ∑t log P(t|θ)C(t)
P(t|θ) = ∑ s(t) P(s(t)|θ)
11
12

Three human-segmented datasets
◦ 3 data sets, for training, validation, and testing,
500 queries for each set
 Segmented by three editors A, B, C
13

Evaluation metric:
◦ Boundary classification accuracy
w1
w2
N
w3
Y
w4
N
w5
Y
◦ Whole query accuracy: the percentage of queries with perfect
boundary classification accuracy
◦ Segment accuracy: the percentage of segments being
recovered
 Truth [abc] [de] [fg]
 Prediction: [abc] [de fg]: precision
14
15
16


Summary of Segmentation approaches
Use for Improving Search Relevance
◦ Query rewriting
◦ Ranking features

Conclusions
17


Phrase Proximity Boosting
Phrase Level Query Expansion
18

Classifying a segment into one of three
categories
◦ Strong concept: no word reordering, no word
insertion/deletion
 Treat the whole segment as a single unit in matching
and ranking
◦ Weak concept: allow word reordering or
deletion/insertion
 Boost documents matching the weak concepts
◦ Not a concept
 Do nothing
19

Concept based BM25
◦ Weighted by the confidence of concepts

Concept based min coverage
◦ Weighted by the confidence of concepts
20

Phrase level replacement
◦ [San Francisco] -> [sf]
◦ [red eye flight] ->[late night flight]
21

Significant relevance boosting
◦ Affects 40% query traffic
◦ Significant DCG gain (1.5% for affected queries)
◦ Significant online CTR gain (0.5% over all)
22


Summary of Segmentation approaches
Use for Improving Search Relevance
◦ Query rewriting
◦ Ranking features

Conclusions
23


Data is segmentation is important for query
segmentation
Phrases are important for improving
relevance
24




Bergsma et al, EMNLP-CoNLL07
Risvik et al. WWW 2003
Hagen et al SIGIR 2010
Tan & Peng, WWW 2008
25
26

Solution 1: Offline segment the web corpus, then
collect counts for ngrams being segments
harry potter and the goblet of fire += 1
... …
| Harry Potter and
the Goblet of Fire |
is | the | fourth |
novel | in | the |
Harry Potter series |
written by | J.K.
Rowling |
... ...
potter and the goblet of += 0
C. G. de Marcken, Unsupervised Language Acquisition, 96
Fuchun Peng, Self-supervised Chinese Word Segmentation,
IDA01
• Technical difficulties
27

Solution 2: Online computation: only consider
parts of the web corpus overlapping with the
query (longest matches)
Q=harry potter and the goblet of fire
... …
Harry Potter and the
Goblet of Fire is the
fourth novel in the
Harry Potter series
written by J.K.
Rowling
... ...
harry potter and the goblet of fire += 1
the += 2
harry potter += 1
28
29

Solution 2: Online computation: only consider
parts of the web corpus overlapping with the
query (longest matches)
Q= potter and the goblet
... …
Harry Potter and the
Goblet of Fire is the
fourth novel in the
Harry Potter series
written by J.K.
Rowling
... ...
potter and the goblet += 1
the += 2
potter += 1
Directly compute longest matching counts
using raw ngram frequency: O(|Q|2)
30
Download