CS276: Programming Assignment 2 Richard Frankel, Tim Harrington frankel@cs.stanford.edu, taharrin@stanford.edu Naïve Bayes Classification Discuss accuracy of Multivariate Naïve Bayes and Multinomial Naïve Bayes While the multivariate (Bernoulli) model and the multinomial model are both equally valid models to use with a Naïve Bayes classifier, they differ in ways that affect classification accuracy. While the multinomial model keeps track of the number of word occurrences in a document, the multivariate model keeps only a binary record. Chi-squared Feature Selection Effects of and a discussion on chi-squared feature selection. K-fold Cross Validation Comment on the changes in accuracy due to and implications of applying k-fold cross-validation Transformed Weight-normalized Complement Naïve Bayes Compare the Naïve Bayes classifier with the Transformed Weight-normalized Complement Naïve Bayes (TWCNB) classifier. Domain Specific Techniques Results from applying domain-specific techniques/experimenting with different techniques Support Vector Machine Classification We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass problem we used the one-against-all approach described in [1]. We experimented with three kernel types: linear, polynomial and the radial basis (i.e. "Gaussian" kernel). We found that the polynomial kernel (with default parameters) performed poorly. There may be better parameter choices, but we weren't able to systematically explore the polynomial kernel parameter space (i.e. there is no analogous grid.py tool for the polynomial kernel) to find them. The parameter-less linear kernel performed quite well during testing but was outperformed by a radial basis kernel with tuned parameters. To determine the radial basis parameters we used the grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold cross validation as the objective function, it tries to solve the optimization problem of maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular classification problem. With knowledge of which kernel parameters to use, we implemented an SVM classifier for the 20 newsgroups data set. During training, it converts each message to a normalized tf-idf vector and then builds 20 one-against-all models in memory. It is possible to save the training, testing and model data to the hard drive by changing the SAVE_FOR_OFFLINE_USE member variable to true. See the "10-fold classifier accuracies" table for a summary of the SVM classifier's performance on the 20 newsgroups data set. Other Classifiers Language Model Classifier LingPipe implements a sophisticated language model classifier that constructs a (process) language model per category and a multivariate distribution over the categories. Scores are computed as adjusted sample cross-entropy rates, which allow between-document comparisons. We built a wrapper class called LanguageModelClassifier that uses LingPipe's language model classifier to train on and classify newsgroup messages. Since word position affects language model construction, we did not combine this classifier with any feature selection methods. Running 10-fold cross-validation with this classifier yielded impressive results. At 98.6%, it has the highest accuracy of all the classifiers we tested. KNN Classifier We used also tested LingPipe's KNN classifier on the newsgroup classification problem. Without filtering, this classifier performs poorly. This is a consequence of the similarity measure used, in which documents are converted to a vector of boolean variables that indicate whether a given feature is present or not. As is the case with multivariate Naïve Bayes, the effect of this is that all features are treated equally and the classifier performs poorly on larger documents. Feature Selection with KL Divergence We implemented the KL divergence-based feature selector described in [2] and tested it with our classifier implementations. The results shown in the table below are expected. As the author claims, the KL divergence-based feature selection method (referred to as KL from now on) described in [2] performs as well as the approach based on mutual information (Chi2). However, we do not expect this to always be the case. Chi2 guarantees that the set of retained features F contains N top features from every class (i.e. each class will be equally represented in the set F). This can be done because the feature score (i.e. mutual information) is a function of feature and class, allowing the best features within a class to be identified. On the other hand, the score in KL a function only of feature. This means that it not possible to select the top N features within a particular class. Therefore one risk in using KL is that a particular class may be overrepresented/underrepresented in the set of top features. It could be the case that one of the classes consists of a larger proportion of highscoring KL words, in which case the set of top features will contain a disproportionate number of words from this class. Should this happen then the weaker signals present in the other classes will be inadvertently excluded from the training data. Results: Classifier Accuracy Comparison Feature Selection Method (number of features) Deliverable # None Chi-2 (5613*) KL (5000) KL (10000) 0.83927804 0.93500140 0.94975185 0.85209885 0.88252699 0.85174360 0.81661972 0.88021420 0.84054126 0.83333228 0.90783690 0.87191292 6 7 7 Naïve Bayes (multivariate) Naïve Bayes (multinomial) Naïve Bayes (TWCNB) Naïve Bayes (with domainspecific feature selection) SVM (radial basis kernel) SVM (linear kernel) 0.96940279 0.95718356 0.95426271 0.90905059 0.96144794 0.93589140 0.96436306 0.94576513 8 8 Language model (LingPipe) KNN (LingPipe, K=3) 0.98562993 0.82575208 N/A 0.84297618 N/A 0.85628772 N/A 0.84332968 1 3 5 Classifier This table shows 10-fold cross-validation accuracies for various combinations of classifiers and feature selectors. *This is the size of the set containing the top 300 features from each class. References [1] [2] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. K. Schneider. 2004. A new feature selection score for multinomial naive Bayes text classification based on KL-divergence. In The Companion Volume to the Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July.