Semi-supervised text categorization

advertisement
Michigan State University
The Chinese University of Hong Kong
1
Zenglin Xu1, Rong Jin2, Kaizhu Huang1, Michael R. Lyu1, and Irwin King1
2 Department of Computer Science and Engineering
Department of Computer Science and Engineering
Michigan State University
The Chinese University of Hong Kong
[email protected]
{zlxu, kzhuang, lyu, king}@cse.cuhk.edu.hk
1
2
• Given a small number of labeled documents, it is very challenging
to build a reliable classifier
•.Unlabeled data are helpful in automated text categorization
How to obtain unlabeled documents?
• We can collect the unlabeled documents through search engines
• Semi-supervised learning can take advantage of both the labeled
documents and unlabeled documents
• A general framework for semi-supervised text
categorization that collects the unlabeled
documents via Web search engines.
• A novel discriminative query generation
method
• The categorization framework can
significantly improve the classification
accuracy.
3
1. Query generation: generate a query for every labeled
document (document: (x,y), Vi: vocabulary for i-th document,
w: word weights, ξ: margin error)
2. Text Categorization Models
• D: labeled documents, U: retrieved unlabeled documents
• Auxiliary SVM (y* is the input)
 Query generation that generates the textual
queries for document retrieval
• Semi-supervised SVM (y* is an optimization variable)
Document retrieval that retrieves the Web
documents through the Web search engine
Semi-supervised text categorization utilizing both
the labeled documents and the retrieved unlabeled
Web documents
4
• Data Repositories: 20-newsgroup, Reuters-21578, Ohsumed
• Training data: 5 labeled documents in each category
• Each documents generates one query
• Each query returns 100 unlabeled documents
• Auxi-SVM: Auxiliary SVM (Optimization : QP)
• Semi-SVM: Semi-supervised SVM (Optimization: CCCP)
• Search engine: Google
• Accuracy improvement over SVM:
Auxi-SVM: 26%
Semi-SVM: 34%
CIKM 2008, Napa Valley, California October 26-30, 2008