Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003 Outline Introduction Text Classification Using Stochastic Keyword Generation Experimental Results Conclusion and Future Work Introduction Supervised Text Classification Question: how to use additional data in training to improve the performance? New Text Classification Problem Summaries of texts are available in training, which are more indicative of contents Note: Summaries are not available in classification Example: classification at a help desk Example Email Categories When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. Empty Outlook Message Cannot Open Word File Summary receive emails; some emails have no subject and message body Outline Introduction Text Classification Using Stochastic Keyword Generation Experimental Results Conclusion and Future Work New Text Classification Problem Spaces Users’ emails: space X Categories: space Y Engineers’ summaries (for training): space S Assumption Summaries are much easier to be classified Conventional Classification X Y Training Data {(x1 , y1 ),...,(x l , yl )} Test Data {(x1 , y1 ),...,(xr , yr )} New X Y {(x1 , y1 ),...,(x l , yl )} {(x1 , s1 ),...,(xl , s l )} {(x1 , y1 ),...,(xr , yr )} Text Classification Using SKG Conventional Text Classification Text Classification Using SKG email: x X email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG classification probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y category: y Y Empty Outlook Message Empty Outlook Message Stochastic Keyword Generation Generating Keywords from a Given Text Stochastic Keyword Generation (SKG) Generate keywords and their conditional probabilities of occurrence given the text Example When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. Stochastic Keyword Generation emails receive subject body 0.75 0.68 0.45 0.45 SKG Model texts and associated summaries {(x1 , s1 ( s ,..., s )),..., (x l , s l ( s ,..., s ))} (1) 1 (1) m (l ) 1 (l ) m new text x SKG model : (1 ( X) P( S1 1 | X),..., m ( X) P( S1 1 | X)) probabilit y vector θ(x) (1 (x), 2 (x),..., m (x)) Θ Model for Each Keyword for each keyword {(x1 , S j 0), (x 2 , S j 1),..., (x l , S j 0)} new text x naive Bayesian model feature selection with IG : P( S j 1, X) j ( X) P( S j 1 | X) P( S j 0, X) P( S j 1, X) conditiona l probabilit y j ( x) P ( S j 1 | x) Learning Using SKG classified texts : {(x1 , y1 ),...,(x l , yl )} SKG training data {(θ(x1 ), y1 ),...,(θ(xl ), yl )} classification m h (θ(x)) arg max [ wi(,Sy) i (x) b y ] y i 1 Outline Introduction Text Classification Using Stochastic Keyword Generation Experimental Results Conclusion and Future Work Data in Experiments Data of the Help Desk of Microsoft 2517 texts from 52 categories About 10000 unique words in texts About 1500 unique words in summaries Conducted stopword removal, but not stemming Training/Test Split 5-fold cross validation Experimental Settings Classifiers Linear SVM (Platt 1998; Dumais et al. 1998) Perceptron algorithm with margins (PAM) (Li et al. 2002) Methods Text classification using SKG Methods for comparison: Prior Texts for training Summaries for training (text+summary)s for training Deterministic keyword generation (DKG) Experimental Results Top 1 Accuracy (%) Top 3 Accuracy (%) Prior 34.1 47.8 Text (PAM) 58.7 73.7 Sum (PAM) 50.0 69.1 Text+Sum (PAM) 56.2 70.4 SKG (PAM) 63.6 78.9 Text (SVM) 57.3 76.7 Sum (SVM) 56.7 71.2 Text+Sum (SVM) 47.4 73.7 SKG (SVM) 61.5 81.5 Method SKG versus DKG Top 1 Accuracy (%) Top 3 Accuracy (%) DKG (PAM) 59.8 73.9 SKG (PAM) 63.6 78.9 DKG (SVM) 57.6 76.9 SKG (SVM) 61.5 81.5 Method Discussion email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y Empty Outlook Message summary: x X receive emails; some emails have no subject and message body Outline Introduction Text Classification Using Stochastic Keyword Generation Experimental Results Conclusion and Future Work Conclusion and Future Work Conclusion Text classification using SKG significantly outperforms the methods without using it Future Work Theoretical analysis of the problem and the proposed method Applied in different settings Thank You