Japanese Abbreviation Expansion with Query and Clickthrough Logs Kei Uchiumi†, Mamoru Komachi‡, Keigo Machinaga, Toshiyuki Maezawa†, Toshinori Satou†, Yoshinori Kobayashi† : Yahoo Japan Corporation ‡ : Nara Institute of Science and Technology † 1 Query expansion improves recall for search engines “cod” “Call of Duty” 2 Once: Using handmaid dictionary Lexicographers detected pairs of queries and expansions 3 Recently : Hard to compile manually Time consuming to construct a dictionary Requires domain knowledge The web grows rapidly Even harder to maintain an up-to-date dictionary 4 Our purpose: Generating an abbreviation dictionary from web search logs Clickthrough logs Learning semantic categories [Komachi et al. 2009] Named entity extraction [Jain et al. 2010] Search query logs Query alteration [Hagiwara et al. 2009] Acquiring semantic categories [Sekine et al. 2007] Excellent resource for many NLP applications in web domain 5 The main contribution 1. 2. Novel re-ranking method to combine web query and clickthrough logs First attempt to automatically recognize full spellings given Japanese abbreviation This method is used as assistant tool for making dictionary in Yahoo! Japan 6 Agenda 1. 2. Introduction Query reformulation based on noisy channel model 1. 2. 3. 4. 5. Query Abbreviation model Query Language Model Evaluation Related work Conclusion 7 Agenda 1. 2. Introduction Query reformulation based on noisy channel model 1. 2. 3. 4. 5. Query Abbreviation model Query Language Model Evaluation Related work Conclusion 8 Noisy Channel Model for query reformulation c * = argmax P(c | q) c P(c)P(q | c) = argmax c P(q) = argmax P(c)P(q | c) c q : query, c : correct query 9 Query Abbreviation Model Query Language Model Reformulation flow Clickthrough graph Clickthrough logs Query : q Candidates : c1,c2,c3,… Query Abbreviation Model Reranking Search query logs Offline Query language model 10 Outputs: ca, cb, cc, … Online part Label propagation on clickthrough graph www.abc-tokyo.com abc american broadcasting corporation abcnews.go.com alphabet song www.alphabetsong.org austrian ballet company en.wikipedia.org The depth of the color of lines indicates relatedness between each node. The depth of the color of nodes represents relatedness to the seed. 11 Problems of adopting [Komachi et al. 2009] to our query reformulation task Preliminary experiments showed that [Komachi et al. 2009] cannot be directly applied to our task 1. 2. Extracted not only synonymous expressions but also semantically Failed to alleviate semantic drift because of using normalized frequency 12 One step approximation prevents extracting non-synonymous expressions The one step approximation extracts queries landing on the same URL by 1-hop label propagation. These queries are possibly synonyms of the seed and thus possible to correct without semantic transformation. 13 Using normalized PMI [Bounma, 2009] as countermeasure against semantic drift P(x, p) PMI(x, p) = ln [-¥, -ln P(x, y)] P(x)P(p) PMI assigns high scores to low-frequency events ì P(x, p) ü NPMI(x, p) = íln ý -ln P(x, p) [-1,1] î P(x)P(p) þ Using naively makes clickthrough graph dense 14 Cutting off the negative values ì NPMI(x , p ) (NPMI(x , p ) > q ) ï i j i j Wij = í 0 (NPMI(xi , Pj ) £ q ) ïî [0,1] Edges are represented as (i,j)-th element of matrix W • Cut off the values lower than threshold θ (θ≥0) • The range of Wij can be nomalized to [0,1] • Prevents W from being dense • Reduces the noise in the data 15 Reformulation flow Clickthrough graph Clickthrough logs Query : q Candidates : c1,c2,c3,… Query Abbreviation Model Reranking Search query logs Offline Query language model 16 Outputs: ca, cb, cc, … Online part Character n-gram query language model P(c) = N -1 Õ P(x i | xi-N +1,… , xi-1 ) i=0 = N -1 Õ i=0 freq(xi-N +1,… , xi ) freq(xi-N +1,… , xi-1 ) C is a contiguous sequence of N characters. c = {x0,x1,…,xn-1} A language model estimated from search query logs P(c) represents likelihood of c as a query 17 Character n-gram is robust for Japanese web NLP Hard to compute the likelihood of neologisms by word n-gram language model Characters themselves carry essential semantic information in Chinese and Japanese [Asahara and Matsumoto, 2004][Huang and Zhao, 2006] Using character 5-grams for query language model 18 Agenda 1. 2. Introduction Query reformulation based on noisy channel model 1. 2. 3. 4. 5. Query Abbreviation model Query Language Model Evaluation Related work Conclusion 19 Japanese abbreviation expansion data set Test set 1916 of ‘Acronym’, ’Kanji’, ‘Kana’ abbreviations Collected from the Japanese version of Wikipedia Removed single letters and duplications Training set Clickthrough logs 2009/10/22 – 2009/11/9, 2010/1/1 – 2010/1/16 About 17,000,000 pairs of queries and URLs Cut off pairs occurred less than 10 times Web search query logs 2009/8/1 – 2010/1/27 About 52,000,000 unique queries Cut off queries occurred less than 10 times 20 Judgment guideline Table1: Correction patterns for abbreviation expansion 1 Acronym for its English expansion 2 Acronym for its Japanese orthography 3 Japanese abbreviation for its Japanese orthography 4 Japanese abbreviation for its English orthography Table2: Examples of abbreviations and corrections pairs Correction patterns Abbreviation Correct candidates 1 adf Asian dub foundation 2 ana 全日本空輸株式会社(All Nippon Airways) 3 ハンスト ハンガーストライキ(Hunger Strike) 4 イラレ illustrator 21 Evaluation measure precision = coverage # of correct output at rank k Number of output at rank k gives at least one correct output = Number of all input queries • The agreement rate of judgment of abbreviation/expansion pair: 47.0 % • Cohen’s kappa measure κ = 0.63 22 Comparison methods Evaluated reranking performance of 50 candidates extracted from clickthrough logs Candidates are extracted by one step approximation Compared three reranking methods 1. 2. 3. Ranking using abbreviation model (AM) only Reranking using language model (LM) only Reranking using both AM and LM 23 Reranking with query language model improves both precision and coverage at top-10 k Query abbreviation model (QAM) Query language model(QLM) QLM+QAM precision coverage precision coverage precision coverage 1 0.114 0.114 0.157 0.157 0.161 0.161 3 0.112 0.256 0.142 0.278 0.157 0.321 5 0.121 0.341 0.128 0.346 0.142 0.392 10 0.114 0.453 0.102 0.425 0.115 0.465 30 0.087 0.536 0.078 0.529 0.082 0.542 50 0.073 0.557 0.073 0.557 0.073 0.557 The result of using only QAM is equivalent to the method of Komachi et al. (2009) using NPMI instead of raw frequency 24 Examples of input and candidates or its correction Input Candidates 写植 写真植字 (photocomposition), 写植 方, 漫画 満鉄 南満州鉄道株式会社(South Manchuria Railway Corporation) はねトび はねるのとびら, はねるのトびら vod ビデオオンデ, ビデオ・オン・デマンド(Video on Demand) ilo 国際労働機関(International Labour Organization), 国際労働期間 pr パブリック・リレーションズ(public relations), prohoo!マ, プラ Blue: Correct Red: Incorrect 25 Error Analysis Table3: types of errors 1 A partial correct query 2 3 A correct query but with an additional attribute word A related but not abbreviated term Beside above reason: 280 out of 1,916 queries did not exist in clickthrough logs 26 A partial correct query The likelihood of the partial query becomes higher than that of its correct spelling Although the likelihood was divided by the length of candidate’s string, still fail to filter fragments of queries vod ビデオオンデ, ビデオオンデマンド(Video on Demand) 27 A correct query but with an additional attribute word Include the combination of correct queries and commonly used attribute words e.g. “* 意味(* meaning)”, “* とは(what does * mean?)”, etc. 857 queries were classified as incorrect that cooccurred with these attribute words. 写植 写真植字 意味, 写真植字 (photocomposition) 28 A related but not abbreviated term A number of abbreviations coincide with other general nouns e.g. “dog (DOG: Disk Original Group)” Hard to expand these abbreviations correctly at present 29 Agenda 1. 2. Introduction Query reformulation based on noisy channel model 1. 2. 3. 4. 5. Query Abbreviation model Query Language Model Evaluation Related work Conclusion 30 Related Work Spelling Correction based on edit distance 1. 2. Synonym extraction 1. Using noisy channel model with a language model created from query logs [Cucerzan and Brill, 2004] Reranking method applying neural net to the spelling correction candidates obtained from Cucerzan’s method [Gao et al. 2010][Sun et al. 2010] Using similarity based on JS divergence of commonly clicked URL distribution between queries [Wei et al. 2009] Query expansion 1. Proposed a unified approach using CRFs with extended feature function [Guo et al. 2008] 31 Agenda 1. 2. Introduction Query reformulation based on noisy channel model 1. 2. 3. 4. 5. Query Abbreviation model Query Language Model Evaluation Related work Conclusion 32 6. Conclusion Have proposed a query expansion method using the web search logs In experiment, found that a combination of label propagation and language model outperformed other methods using either label propagation or language model In the future, will address this task using discriminative learning as a ranking problem 33 ANY QUESTIONS? 34 PageRank on a query graph 国際労働機関 とは 国際労働機関 意味 国際労働機関 国際労働機関 役割(role) Partial queries do not co-occur with attribute words frequently Edges represent common co-occurring words between queries Will assign higher scores to correct queries than a QLM and QAM 35 Parameters Construction of a clickthrough graph The threshold θ of elements Wij was set to 0.1 The parameter α for label propagation was set to 0.0001 Construction of a language model Character 5-gram Likelihood was divided by the length of candidate’s string 36 Correct candidates types Table: correct candidate types 1 Named entity 2 Common expression 3 Japanese meaning of the common expression 37 Cohen’s kappa U2 Yes U2 No U1 Yes 56 47 Kappa = 0.63 38 U1 No 16 3376 [Komachi et al. 2009] Suggested that normalized frequency causes semantic drift Suggested using relative frequency as countermeasure against semantic drift 39 P-values of Wilcoxon’s signed rank test P-value QAM and QAM+QLM QLM and QAM+QLM 0.055 7.79e-10 Comparison of harmonic mean between precision and coverage each model with k ranking from 1 to 50 40 Query abbreviation model Uses the label propagation method on a clickthrough graph (based on [Komachi et al. 2009] ) The probability of the label propagation can be regarded as the conditional probability P(q|c) The label propagation is mathematically identical to the random walk with restart[Tong and Faloustos KDD 06] 41