Text Classification Using Stochastic Keyword Generation

advertisement
Text Classification Using
Stochastic Keyword Generation
Cong Li, Ji-Rong Wen and Hang Li
Microsoft Research Asia
August 22nd, 2003
Outline




Introduction
Text Classification Using Stochastic
Keyword Generation
Experimental Results
Conclusion and Future Work
Introduction

Supervised Text Classification


Question: how to use additional data in
training to improve the performance?
New Text Classification Problem



Summaries of texts are available in
training, which are more indicative of
contents
Note: Summaries are not available in
classification
Example: classification at a help desk
Example

Email


Categories



When getting emails I get a notice that an
email has been received but when I try to view
the message it is blank. I have also tried to run
the repair program off the install disk but that
it did not take care of the problem.
Empty Outlook Message
Cannot Open Word File
Summary

receive emails; some emails have no subject and
message body
Outline




Introduction
Text Classification Using Stochastic
Keyword Generation
Experimental Results
Conclusion and Future Work
New Text Classification Problem

Spaces




Users’ emails: space X
Categories: space Y
Engineers’ summaries (for training): space S
Assumption

Summaries are much easier to be classified
Conventional
Classification
X Y
Training Data
{(x1 , y1 ),...,(x l , yl )}
Test Data
{(x1 , y1 ),...,(xr , yr )}
New
X Y
{(x1 , y1 ),...,(x l , yl )}
{(x1 , s1 ),...,(xl , s l )}
{(x1 , y1 ),...,(xr , yr )}
Text Classification Using SKG
Conventional Text Classification
Text Classification Using SKG
email: x  X
email: x  X
When getting emails I get a notice that an
email has been received but when I try to
view the message it is blank. I have also tried
to run the repair program off the install disk
but that it did not take care of the problem.
When getting emails I get a notice that an
email has been received but when I try to
view the message it is blank. I have also tried
to run the repair program off the install disk
but that it did not take care of the problem.
SKG
classification
probability vector: (x)  
(email 0.75, receive 0.68, subject 0.45, body 045, … )
classification
category: y  Y
category: y  Y
Empty Outlook Message
Empty Outlook Message
Stochastic Keyword Generation


Generating Keywords from a Given Text
Stochastic Keyword Generation (SKG)


Generate keywords and their conditional
probabilities of occurrence given the text
Example
When getting emails I get a notice
that an email has been received but
when I try to view the message it is
blank. I have also tried to run the
repair program off the install disk
but that it did not take care of the
problem.
Stochastic
Keyword
Generation
emails
receive
subject
body



0.75
0.68
0.45
0.45



SKG Model
texts and associated summaries
{(x1 , s1  ( s ,..., s )),..., (x l , s l  ( s ,..., s ))}
(1)
1
(1)
m
(l )
1
(l )
m
new text x
SKG model :
(1 ( X)  P( S1  1 | X),..., m ( X)  P( S1  1 | X))
probabilit y vector
θ(x)  (1 (x), 2 (x),..., m (x)) Θ
Model for Each Keyword
for each keyword
{(x1 , S j  0), (x 2 , S j  1),..., (x l , S j  0)}
new text x
naive Bayesian model  feature selection with IG :
P( S j  1, X)
 j ( X)  P( S j  1 | X) 
P( S j  0, X)  P( S j  1, X)
conditiona l probabilit y
 j ( x)  P ( S j  1 | x)
Learning Using SKG
classified texts : {(x1 , y1 ),...,(x l , yl )}
SKG
training data {(θ(x1 ), y1 ),...,(θ(xl ), yl )}
classification
m
h (θ(x))  arg max [ wi(,Sy) i (x)  b y ]
y
i 1
Outline




Introduction
Text Classification Using Stochastic
Keyword Generation
Experimental Results
Conclusion and Future Work
Data in Experiments

Data of the Help Desk of Microsoft





2517 texts from 52 categories
About 10000 unique words in texts
About 1500 unique words in summaries
Conducted stopword removal, but not stemming
Training/Test Split

5-fold cross validation
Experimental Settings

Classifiers



Linear SVM (Platt 1998; Dumais et al. 1998)
Perceptron algorithm with margins (PAM) (Li et al.
2002)
Methods


Text classification using SKG
Methods for comparison:





Prior
Texts for training
Summaries for training
(text+summary)s for training
Deterministic keyword generation (DKG)
Experimental Results
Top 1
Accuracy (%)
Top 3
Accuracy (%)
Prior
34.1
47.8
Text (PAM)
58.7
73.7
Sum (PAM)
50.0
69.1
Text+Sum (PAM)
56.2
70.4
SKG (PAM)
63.6
78.9
Text (SVM)
57.3
76.7
Sum (SVM)
56.7
71.2
Text+Sum (SVM)
47.4
73.7
SKG (SVM)
61.5
81.5
Method
SKG versus DKG
Top 1
Accuracy (%)
Top 3
Accuracy (%)
DKG (PAM)
59.8
73.9
SKG (PAM)
63.6
78.9
DKG (SVM)
57.6
76.9
SKG (SVM)
61.5
81.5
Method
Discussion
email: x  X
When getting emails I get a notice that an
email has been received but when I try to
view the message it is blank. I have also tried
to run the repair program off the install disk
but that it did not take care of the problem.
SKG
probability vector: (x)  
(email 0.75, receive 0.68, subject 0.45, body 045, … )
classification
category: y  Y
Empty Outlook Message
summary: x  X
receive emails; some emails have no subject
and message body
Outline




Introduction
Text Classification Using Stochastic
Keyword Generation
Experimental Results
Conclusion and Future Work
Conclusion and Future Work

Conclusion


Text classification using SKG
significantly outperforms the methods
without using it
Future Work


Theoretical analysis of the problem and
the proposed method
Applied in different settings
Thank You
Download