10-Fold Classifier Accuracies Classifier Naïve Bayes (multinomial) SVM (radial basis kernel) Language model SVM (linear kernel) Feature Selection Method (number of features) None Chi-2 (5613*) KLD (5000) KLD (10000) 0.93500140 N/A 0.98562993 0.88252699 0.95426271 N/A 0.90905059 0.88021420 0.96144794 N/A * this is the size of the set containing the top 300 features from each class We are looking for the following accuracies CNB: 95.75% WCNB: 95.75% TWCNB: 94.25% 0.90783690 N/A Support Vector Machine Classification We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass problem we used the one-against-all approach described in [1]. To eliminate noise in the dataset we preprocessed the messages using the Chi-squared feature selection method that is described in the 2nd deliverable. To train the SVM, we first converted each message to a normalized tf-idf vector. Then, for each newsgroup i , we constructed a training file train.i and a testing file test.i from the tf-idf vectors. The train.i file contains 9/10 of the messages from newsgroup i and test.i contains the remaining 1/10. All messages from newsgroup i are labeled with +1 (as positive training examples). A matching number of negative examples (i.e. messages from newsgroups other than i ) were included in train.i and test.i with an equal number of examples selected from each newsgroup other than i . We experimented with three kernel types: linear, polynomial and the radial basis (i.e. "Gaussian" kernel). We found that the polynomial kernel (with default parameters) performed poorly. However, this was probably due to our inability to systematically explore the polynomial kernel parameter space (i.e. there is no analogous grid.py tool for the polynomial kernel). The parameter-less linear kernel performed quite well during testing but was outperformed by a radial basis kernel with tuned parameters. To determine the radial basis parameters we used the grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold cross validation as the objective function, it tries to solve the optimization problem of maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular classification problem. With knowledge of which kernel parameters to use, we implemented an SVM classifier for the 20 newsgroups data set. During training, it converts each message to a normalized tf-idf vector and then builds 20 one-against-all models in memory. It is possible to save the training, testing and model data to the hard drive by changing the SAVE_FOR_OFFLINE_USE member variable to true. See the "10-fold classifier accuracies" table for a summary of the SVM classifier's performance on the 20 newsgroups data set. Feature Selection with KL Divergence The results shown in the "10-fold classifier accuracies" table are telling. Despite the author's claim to the contrary, the KL divergence-based feature selection method (KLD) described in [2] does not perform as well as the approach based on mutual information (Chi-2). One reason for this is that the Chi-2 guarantees that the set of retained features K contains N top features from every class (i.e. each class will be equally represented in the set K). This can be done because the feature score (i.e. mutual information) is a function of feature and class, allowing the best features within a class to be identified. On the other hand, the score in KLD a function only of feature. This means that it not possible to select the top K features within a particular class. Therefore one risk in using KLD is that a particular class may be overrepresented/underrepresented in the set of top features. It could be the case that one of the classes consists of a larger proportion of high-scoring KLD words, in which case the set of top features will contain a disproportionate number of words from this class. Should this happen then the weaker signals present in the other classes will be inadvertently excluded from the training data. References [1] [2] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. K. Schneider. 2004. A new feature selection score for multinomial naive Bayes text classification based on KL-divergence. In The Companion Volume to the Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July.