Michigan State University The Chinese University of Hong Kong 1 Zenglin Xu1, Rong Jin2, Kaizhu Huang1, Michael R. Lyu1, and Irwin King1 2 Department of Computer Science and Engineering Department of Computer Science and Engineering Michigan State University The Chinese University of Hong Kong rongjin@cse.msu.edu {zlxu, kzhuang, lyu, king}@cse.cuhk.edu.hk 1 2 • Given a small number of labeled documents, it is very challenging to build a reliable classifier •.Unlabeled data are helpful in automated text categorization How to obtain unlabeled documents? • We can collect the unlabeled documents through search engines • Semi-supervised learning can take advantage of both the labeled documents and unlabeled documents • A general framework for semi-supervised text categorization that collects the unlabeled documents via Web search engines. • A novel discriminative query generation method • The categorization framework can significantly improve the classification accuracy. 3 1. Query generation: generate a query for every labeled document (document: (x,y), Vi: vocabulary for i-th document, w: word weights, ξ: margin error) 2. Text Categorization Models • D: labeled documents, U: retrieved unlabeled documents • Auxiliary SVM (y* is the input) Query generation that generates the textual queries for document retrieval • Semi-supervised SVM (y* is an optimization variable) Document retrieval that retrieves the Web documents through the Web search engine Semi-supervised text categorization utilizing both the labeled documents and the retrieved unlabeled Web documents 4 • Data Repositories: 20-newsgroup, Reuters-21578, Ohsumed • Training data: 5 labeled documents in each category • Each documents generates one query • Each query returns 100 unlabeled documents • Auxi-SVM: Auxiliary SVM (Optimization : QP) • Semi-SVM: Semi-supervised SVM (Optimization: CCCP) • Search engine: Google • Accuracy improvement over SVM: Auxi-SVM: 26% Semi-SVM: 34% CIKM 2008, Napa Valley, California October 26-30, 2008