Support Vector Machine Classification

advertisement
10-Fold Classifier Accuracies
Classifier
Naïve Bayes (multinomial)
SVM (radial basis kernel)
Language model
SVM (linear kernel)
Feature Selection Method (number of features)
None
Chi-2 (5613*) KLD (5000)
KLD (10000)
0.93500140
N/A
0.98562993
0.88252699
0.95426271
N/A
0.90905059
0.88021420
0.96144794
N/A
* this is the size of the set containing the top 300 features from each class
We are looking for the following accuracies
CNB: 95.75%
WCNB: 95.75%
TWCNB: 94.25%
0.90783690
N/A
Support Vector Machine Classification
We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass
problem we used the one-against-all approach described in [1]. To eliminate noise in the dataset
we preprocessed the messages using the Chi-squared feature selection method that is described
in the 2nd deliverable.
To train the SVM, we first converted each message to a normalized tf-idf vector. Then, for each
newsgroup i , we constructed a training file train.i and a testing file test.i from the tf-idf vectors.
The train.i file contains 9/10 of the messages from newsgroup i and test.i contains the remaining
1/10. All messages from newsgroup i are labeled with +1 (as positive training examples). A
matching number of negative examples (i.e. messages from newsgroups other than i ) were
included in train.i and test.i with an equal number of examples selected from each newsgroup
other than i .
We experimented with three kernel types: linear, polynomial and the radial basis (i.e. "Gaussian"
kernel). We found that the polynomial kernel (with default parameters) performed poorly.
However, this was probably due to our inability to systematically explore the polynomial kernel
parameter space (i.e. there is no analogous grid.py tool for the polynomial kernel).
The parameter-less linear kernel performed quite well during testing but was outperformed by a
radial basis kernel with tuned parameters. To determine the radial basis parameters we used the
grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold
cross validation as the objective function, it tries to solve the optimization problem of
maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups
data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular
classification problem.
With knowledge of which kernel parameters to use, we implemented an SVM classifier for the
20 newsgroups data set. During training, it converts each message to a normalized tf-idf vector
and then builds 20 one-against-all models in memory. It is possible to save the training, testing
and model data to the hard drive by changing the SAVE_FOR_OFFLINE_USE member variable
to true.
See the "10-fold classifier accuracies" table for a summary of the SVM classifier's performance
on the 20 newsgroups data set.
Feature Selection with KL Divergence
The results shown in the "10-fold classifier accuracies" table are telling. Despite the author's
claim to the contrary, the KL divergence-based feature selection method (KLD) described in [2]
does not perform as well as the approach based on mutual information (Chi-2). One reason for
this is that the Chi-2 guarantees that the set of retained features K contains N top features from
every class (i.e. each class will be equally represented in the set K). This can be done because the
feature score (i.e. mutual information) is a function of feature and class, allowing the best
features within a class to be identified.
On the other hand, the score in KLD a function only of feature. This means that it not possible to
select the top K features within a particular class. Therefore one risk in using KLD is that a
particular class may be overrepresented/underrepresented in the set of top features. It could be
the case that one of the classes consists of a larger proportion of high-scoring KLD words, in
which case the set of top features will contain a disproportionate number of words from this
class. Should this happen then the weaker signals present in the other classes will be
inadvertently excluded from the training data.
References
[1]
[2]
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector
machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
K. Schneider. 2004. A new feature selection score for multinomial naive Bayes text
classification based on KL-divergence. In The Companion Volume to the Proceedings of
42nd Annual Meeting of the Association for Computational Linguistics, Barcelona,
Spain, July.
Download