Cross-Language Query Classification using Web Search for Exogenous Knowledge Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, Bo Pang (University of Massachusetts, Yahoo! Research) WSDM ’09 Motivation • Non-English Web is growing rapidly • But language processing tools and resources are not • Text Classification • Taxonomy : Open Directory Project Idea • “麥子杰” 屬於哪一分類? • 1. Directly Translate: 麥子杰 => Chinese to English => “Wheat Hero” => English Text Classifier => “Food” • 2. Using English Search Results 麥子杰 => Chinese to English => “Wheat Hero” => Search Engine => English Pages => English Text Classifier => … • 3. Using Chinese Search Results 麥子杰 => Search Engine => Chinese Pages => Chinese to English => English Pages => English Text Classifier => “Singer” Why 3rd method is better? • Although the qualities of translation by MT systems : short query > document • 獨裁 > 民主 !? • Assume: - short query has 50% for correctness - each unigrams in document has α= 20% for correctness - document length is N - there are K = 9 classes - uniform noise • Expected Votes for Correctness: α * N = 0.2 * N • Expected Votes for Incorrectness: (1 - α) * N / (K – 1) = 0.1 * N • The Chance of Incorrectness: Experiment • 1. Crawl top 32 search results • 2. Translate these pages into English by – Google Translate – Babelfish (Yahoo Translate) – Dictionary-based Translation (CEDICT) • 3. Do classification by existing classifier trained on English data – centroid-based classifier – up to 5 ranked labels for each page • 4. Vote among the predicted classes – weighted equally Evaluation • divide the query log into ten deciles by query frequency in log scale • randomly sample the same number of queries from each decile (10 deciles) – C200, C1000, R100 • each query-class pair was judged by two Chinese speakers English results + Chinese results Macro-combination Micro-combination