Japanese Abbreviation Expansion with Query and Clickthrough Logs

advertisement
Japanese Abbreviation Expansion
with Query and Clickthrough Logs
Kei Uchiumi†, Mamoru Komachi‡, Keigo Machinaga,
Toshiyuki Maezawa†, Toshinori Satou†, Yoshinori Kobayashi†
: Yahoo Japan Corporation
‡ : Nara Institute of Science and Technology
†
1
Query expansion improves recall
for search engines
“cod”
“Call of Duty”
2
Once: Using handmaid dictionary

Lexicographers detected pairs of queries
and expansions
3
Recently : Hard to compile manually



Time consuming to construct a dictionary
Requires domain knowledge
The web grows rapidly

Even harder to maintain an up-to-date dictionary
4
Our purpose:
Generating an abbreviation dictionary
from web search logs

Clickthrough logs



Learning semantic categories [Komachi et al. 2009]
Named entity extraction [Jain et al. 2010]
Search query logs


Query alteration [Hagiwara et al. 2009]
Acquiring semantic categories [Sekine et al. 2007]
Excellent resource for many NLP applications
in web domain
5
The main contribution
1.
2.
Novel re-ranking method to combine web
query and clickthrough logs
First attempt to automatically recognize full
spellings given Japanese abbreviation
This method is used as assistant tool for
making dictionary in Yahoo! Japan
6
Agenda
1.
2.
Introduction
Query reformulation based on noisy
channel model
1.
2.
3.
4.
5.
Query Abbreviation model
Query Language Model
Evaluation
Related work
Conclusion
7
Agenda
1.
2.
Introduction
Query reformulation based on noisy
channel model
1.
2.
3.
4.
5.
Query Abbreviation model
Query Language Model
Evaluation
Related work
Conclusion
8
Noisy Channel Model
for query reformulation
c
*
=
argmax P(c | q)
c
P(c)P(q | c)
= argmax
c
P(q)
=
argmax P(c)P(q | c)
c
q : query, c : correct query
9
Query Abbreviation
Model
Query Language
Model
Reformulation flow
Clickthrough graph
Clickthrough logs
Query : q
Candidates :
c1,c2,c3,…
Query Abbreviation Model
Reranking
Search query logs
Offline
Query language
model
10
Outputs: ca, cb, cc, …
Online part
Label propagation on clickthrough graph
www.abc-tokyo.com
abc
american
broadcasting
corporation
abcnews.go.com
alphabet
song
www.alphabetsong.org
austrian
ballet
company
en.wikipedia.org
The depth of the color of lines indicates relatedness between each node.
The depth of the color of nodes represents relatedness to the seed.
11
Problems of adopting [Komachi et al. 2009]
to our query reformulation task
Preliminary experiments showed that [Komachi et
al. 2009] cannot be directly applied to our task
1.
2.
Extracted not only synonymous expressions
but also semantically
Failed to alleviate semantic drift because of
using normalized frequency
12
One step approximation prevents
extracting non-synonymous expressions
The one step approximation extracts queries
landing on the same URL by 1-hop label
propagation.
These queries are possibly synonyms of the
seed and thus possible to correct without
semantic transformation.
13
Using normalized PMI [Bounma, 2009] as
countermeasure against semantic drift

P(x, p)
PMI(x, p) = ln
[-¥, -ln P(x, y)]
P(x)P(p)
PMI assigns high scores to low-frequency
events
ì P(x, p) ü
NPMI(x, p) = íln
ý -ln P(x, p) [-1,1]
î P(x)P(p) þ

Using naively makes clickthrough graph dense
14
Cutting off the negative values
ì NPMI(x , p ) (NPMI(x , p ) > q )
ï
i
j
i
j
Wij = í
0
(NPMI(xi , Pj ) £ q )
ïî
[0,1]
Edges are represented as (i,j)-th element of matrix W
• Cut off the values lower than threshold θ (θ≥0)
• The range of Wij can be nomalized to [0,1]
• Prevents W from being dense
• Reduces the noise in the data
15
Reformulation flow
Clickthrough graph
Clickthrough logs
Query : q
Candidates :
c1,c2,c3,…
Query Abbreviation Model
Reranking
Search query logs
Offline
Query language
model
16
Outputs: ca, cb, cc, …
Online part
Character n-gram query language model
P(c)
=
N -1
Õ P(x
i
| xi-N +1,… , xi-1 )
i=0
=
N -1
Õ
i=0
freq(xi-N +1,… , xi )
freq(xi-N +1,… , xi-1 )
C is a contiguous sequence of N characters.
c = {x0,x1,…,xn-1}
A language model estimated from search query logs
P(c) represents likelihood of c as a query
17
Character n-gram is robust for
Japanese web NLP



Hard to compute the likelihood of
neologisms by word n-gram language model
Characters themselves carry essential
semantic information in Chinese and
Japanese [Asahara and Matsumoto,
2004][Huang and Zhao, 2006]
Using character 5-grams for query language
model
18
Agenda
1.
2.
Introduction
Query reformulation based on noisy
channel model
1.
2.
3.
4.
5.
Query Abbreviation model
Query Language Model
Evaluation
Related work
Conclusion
19
Japanese abbreviation expansion
data set

Test set

1916 of ‘Acronym’, ’Kanji’, ‘Kana’ abbreviations



Collected from the Japanese version of Wikipedia
Removed single letters and duplications
Training set

Clickthrough logs




2009/10/22 – 2009/11/9, 2010/1/1 – 2010/1/16
About 17,000,000 pairs of queries and URLs
Cut off pairs occurred less than 10 times
Web search query logs



2009/8/1 – 2010/1/27
About 52,000,000 unique queries
Cut off queries occurred less than 10 times
20
Judgment guideline
Table1: Correction patterns for abbreviation expansion
1 Acronym for its English expansion
2 Acronym for its Japanese orthography
3 Japanese abbreviation for its Japanese orthography
4 Japanese abbreviation for its English orthography
Table2: Examples of abbreviations and corrections pairs
Correction
patterns
Abbreviation
Correct candidates
1
adf
Asian dub foundation
2
ana
全日本空輸株式会社(All Nippon Airways)
3
ハンスト
ハンガーストライキ(Hunger Strike)
4
イラレ
illustrator
21
Evaluation measure
precision =
coverage
# of correct output at rank k
Number of output at rank k
gives at least one correct output
=
Number of all input queries
• The agreement rate of judgment of abbreviation/expansion
pair: 47.0 %
• Cohen’s kappa measure κ = 0.63
22
Comparison methods

Evaluated reranking performance of 50
candidates extracted from clickthrough logs


Candidates are extracted by one step
approximation
Compared three reranking methods
1.
2.
3.
Ranking using abbreviation model (AM) only
Reranking using language model (LM) only
Reranking using both AM and LM
23
Reranking with query language model improves
both precision and coverage at top-10
k
Query abbreviation
model (QAM)
Query language
model(QLM)
QLM+QAM
precision coverage precision coverage precision coverage
1
0.114
0.114
0.157
0.157
0.161
0.161
3
0.112
0.256
0.142
0.278
0.157
0.321
5
0.121
0.341
0.128
0.346
0.142
0.392
10
0.114
0.453
0.102
0.425
0.115
0.465
30
0.087
0.536
0.078
0.529
0.082
0.542
50
0.073
0.557
0.073
0.557
0.073
0.557
The result of using only QAM is equivalent to the method of Komachi et al.
(2009) using NPMI instead of raw frequency
24
Examples of input and candidates or
its correction
Input
Candidates
写植
写真植字 (photocomposition), 写植 方, 漫画
満鉄
南満州鉄道株式会社(South Manchuria Railway Corporation)
はねトび
はねるのとびら, はねるのトびら
vod
ビデオオンデ, ビデオ・オン・デマンド(Video on Demand)
ilo
国際労働機関(International Labour Organization), 国際労働期間
pr
パブリック・リレーションズ(public relations), prohoo!マ, プラ
Blue: Correct
Red: Incorrect
25
Error Analysis
Table3: types of errors
1
A partial correct query
2
3
A correct query but with an additional attribute word
A related but not abbreviated term
Beside above reason:
280 out of 1,916 queries did not exist in clickthrough logs
26
A partial correct query

The likelihood of the partial query becomes
higher than that of its correct spelling

Although the likelihood was divided by the length
of candidate’s string, still fail to filter fragments
of queries
vod
ビデオオンデ,
ビデオオンデマンド(Video on Demand)
27
A correct query but with an
additional attribute word

Include the combination of correct queries and
commonly used attribute words


e.g. “* 意味(* meaning)”, “* とは(what does * mean?)”, etc.
857 queries were classified as incorrect that cooccurred with these attribute words.
写植
写真植字 意味,
写真植字 (photocomposition)
28
A related but not abbreviated term

A number of abbreviations coincide with
other general nouns


e.g. “dog (DOG: Disk Original Group)”
Hard to expand these abbreviations correctly
at present
29
Agenda
1.
2.
Introduction
Query reformulation based on noisy
channel model
1.
2.
3.
4.
5.
Query Abbreviation model
Query Language Model
Evaluation
Related work
Conclusion
30
Related Work

Spelling Correction based on edit distance
1.
2.

Synonym extraction
1.

Using noisy channel model with a language model created
from query logs
[Cucerzan and Brill, 2004]
Reranking method applying neural net to the spelling
correction candidates obtained from Cucerzan’s method
[Gao et al. 2010][Sun et al. 2010]
Using similarity based on JS divergence of commonly
clicked URL distribution between queries
[Wei et al. 2009]
Query expansion
1.
Proposed a unified approach using CRFs with extended
feature function
[Guo et al. 2008]
31
Agenda
1.
2.
Introduction
Query reformulation based on noisy
channel model
1.
2.
3.
4.
5.
Query Abbreviation model
Query Language Model
Evaluation
Related work
Conclusion
32
6. Conclusion



Have proposed a query expansion method
using the web search logs
In experiment, found that a combination of
label propagation and language model
outperformed other methods using either
label propagation or language model
In the future, will address this task using
discriminative learning as a ranking problem
33
ANY QUESTIONS?
34
PageRank on a query graph
国際労働機関
とは
国際労働機関
意味
国際労働機関
国際労働機関
役割(role)
Partial queries do not co-occur with attribute words frequently


Edges represent common co-occurring words between
queries
Will assign higher scores to correct queries than a QLM
and QAM
35
Parameters

Construction of a clickthrough graph



The threshold θ of elements Wij was set to 0.1
The parameter α for label propagation was set to
0.0001
Construction of a language model


Character 5-gram
Likelihood was divided by the length of
candidate’s string
36
Correct candidates types
Table: correct candidate types
1 Named entity
2 Common expression
3 Japanese meaning of the common expression
37
Cohen’s kappa
U2 Yes
U2 No
U1 Yes
56
47
Kappa = 0.63
38
U1 No
16
3376
[Komachi et al. 2009]


Suggested that normalized frequency causes
semantic drift
Suggested using relative frequency as
countermeasure against semantic drift
39
P-values of Wilcoxon’s signed rank test
P-value
QAM and QAM+QLM
QLM and QAM+QLM
0.055
7.79e-10
Comparison of harmonic mean between precision and
coverage each model with k ranking from 1 to 50
40
Query abbreviation model


Uses the label propagation method on a
clickthrough graph (based on [Komachi et al.
2009] )
The probability of the label propagation can
be regarded as the conditional probability
P(q|c)

The label propagation is mathematically
identical to the random walk with restart[Tong
and Faloustos KDD 06]
41
Download