Cross-Language Query Classification using Web Search for

advertisement
Cross-Language Query Classification
using Web Search for Exogenous
Knowledge
Xuerui Wang, Andrei Broder,
Evgeniy Gabrilovich, Vanja
Josifovski, Bo Pang
(University of Massachusetts,
Yahoo! Research)
WSDM ’09
Motivation
• Non-English Web is growing rapidly
• But language processing tools and
resources are not
• Text Classification
• Taxonomy : Open Directory Project
Idea
• “麥子杰” 屬於哪一分類?
• 1. Directly Translate:
麥子杰
=> Chinese to English
=> “Wheat Hero”
=> English Text Classifier
=> “Food”
• 2. Using English Search Results
麥子杰
=> Chinese to English
=> “Wheat Hero”
=> Search Engine
=> English Pages
=> English Text Classifier
=> …
• 3. Using Chinese Search Results
麥子杰
=> Search Engine
=> Chinese Pages
=> Chinese to English
=> English Pages
=> English Text Classifier
=> “Singer”
Why 3rd method is better?
• Although the qualities of translation by MT
systems : short query > document
• 獨裁 > 民主 !?
• Assume:
- short query has 50% for correctness
- each unigrams in document has α= 20%
for correctness
- document length is N
- there are K = 9 classes
- uniform noise
• Expected Votes for Correctness:
α * N = 0.2 * N
• Expected Votes for Incorrectness:
(1 - α) * N / (K – 1)
= 0.1 * N
• The Chance of Incorrectness:
Experiment
• 1. Crawl top 32 search results
• 2. Translate these pages into English by
– Google Translate
– Babelfish (Yahoo Translate)
– Dictionary-based Translation (CEDICT)
• 3. Do classification by existing classifier
trained on English data
– centroid-based classifier
– up to 5 ranked labels for each page
• 4. Vote among the predicted classes
– weighted equally
Evaluation
• divide the query log into ten deciles by
query frequency in log scale
• randomly sample the same number of
queries from each decile (10 deciles)
– C200, C1000, R100
• each query-class pair was judged by two
Chinese speakers
English results +
Chinese results
Macro-combination
Micro-combination
Download