CS276: Programming Assignment 2 Richard Frankel, Tim Harrington frankel@cs.stanford.edu, taharrin@stanford.edu Classification Results This following table shows the 10-fold cross-validation accuracies we observed for the different combinations of classifiers and feature selection methods. Deliverable # Classifier 1 3 5 Naïve Bayes (multivariate) Naïve Bayes (multinomial) Naïve Bayes (CNB) 5 5 7 7 8 8 Naïve Bayes (WCNB) Naïve Bayes (TWCNB) SVM (radial basis kernel) SVM (linear kernel) Language model (LingPipe) KNN (LingPipe, K=3) Feature Selection Method (number of features kept) None Chi-2 (5613*) KL (5000) KL (10000) 0.83927804 0.93500140 0.85209885 0.88252699 0.81661972 0.88021420 0.83333228 0.90783690 0.96940279 0.95718356 0.98562993 0.82575208 0.95426271 0.90905059 N/A 0.84297618 0.96144794 0.93589140 N/A 0.85628772 0.96436306 0.94576513 N/A 0.84332968 *This is the size of the set containing the top 300 features from each class. Naïve Bayes Classification While the multivariate (Bernoulli) model and the multinomial model are both equally valid models to use with a Naïve Bayes classifier, they differ in ways that affect classification accuracy. One important difference is that the multinomial model keeps track of the number of word occurrences in a document, while the multivariate model keeps only a binary record. In the multivariate case this means all features are weighted equally. Since documents are longer in the 20 newsgroups dataset, the equal weighting of features causes poor classification accuracy when compared with multinomial Naïve Bayes, which weighs scores by feature. As shown in the results section, this is reflected in the classification accuracies we observed for the two methods. Chi-squared Feature Selection Chi-squared feature selection uses the chi-squared sample statistic to measure the lack of independence between a feature and a class. The higher the value, the better discriminator the feature is for the class. For some classification methods applying chi-squared feature selection results in greater classification accuracy while for other methods it worsens accuracy. For example, classifiers that treat all features equally (e.g. multivariate Naïve Bayes or KNN) tend to perform better when chi-squared feature selection is used. This is due to the fact that only important features (by the chi-squared measure) are retained, eliminating the bad features (that were previously given equal weight) from influencing classification results. On the other hand, chi-squared feature selection worsens the accuracy of classifiers that already treat some features as more important than others. Two examples of classifiers that do this are SVM and multinomial Naïve Bayes. With SVM, the classifier considers only the features at the margin of the classes and with multinomial Naïve Bayes, feature contributions to the classification scores are proportional to how many times they appear in the corpus. Classifiers like these glean signal from all the features, but weigh the feature contributions by the some measure of the quality of information that they bear. As such, using feature selection to reduce the number of features necessarily gives the classifiers less information to work with and, hence, the classification accuracies suffer. There is a benefit to doing this though. When computational costs are high and information must be discarded/ignored, then chi-squared feature selection provides a way to identify the features that are safest to ignore (i.e. that will harm accuracy the least). K-fold Cross Validation K-fold cross-validation is a technique used to assess how well classifier performance will generalize to other data sets. In other words, it provides a way to estimate how well the classifier will perform in practice. It is a useful method for testing hypothesis suggested by the data. The accuracy in general appears lower than simply training on all the data and then re-using the for classifier testing. Intuitively, it makes sense that testing the classifier with previously seen training examples will result in higher accuracy than when testing with never-seen-before examples. We implemented an honest k-fold cross-validation class. To calculate the folds each newsgroup was first split into K chunks, with the intergroup chunks of size as equal as possible (e.g. with k=3, a group of size 11 gets split into chunks of size 4,4,3). The chunks in each group were then number from 1 to k. To construct the training set for fold j, the chunks not equal to j were merged from all newsgroups. To construct the testing set we did the same except we excluded all chunks not equal to j. Transformed Weight-normalized Complement Naïve Bayes We implemented TWCNB in three layers on top of our Multinomial Naïve Bayes classifier. First, we implemented Complement Naïve Bayes using the suggested values for i and . This achieved the following results using 10-Fold Cross-Validation. There was a large improvement here over Multinomial NB. <results> Tim: How do you print the results by class? Next, we implemented Weight-normalized Complement Naïve Bayes, with the following results, and got a minor improvement: <results> Finally, we implemented Transformed Weight-normalized Complement Naïve Bayes, which actually resulted in a slight decrease in performance: <results> …again, not sure what else to say here. Not really sure why TWCNB was worse than WCNB – do you have any ideas? Domain Specific Techniques We first implemented the following domain-specific enhancements at the parsing level. 1. Mapping of numbers to a special NUM token 2. Mapping of email addresses to a special EMAIL token 3. Mapping of hyperlinks to a special LINK token 4. Mapping of prices (dollars, pounds, Euros, and yen) to a special PRICE token 5. Skipping of quote lines (lines beginning with ‘>’). Our intent here was to avoid the double-counting of quoted replies. 6. Upweighting of subject lines. Accuracy Interestingly, the above techniques actually worsened the multinomial classifier accuracy. Simply using enhancements 1 and 4 produced the following results, which suggests that the numbers and prices present in the 20 newsgroups data set are useful for discriminating between classes. 0.94 0.92 0.9 0.88 0.86 0.84 0.82 Naïve Bayes (multinomial) Naïve Bayes (multinomial w/ enhancements 1 & 4) None Chi-2 (5613*) KL (5000) KL (10000) Accuracy Using only enhancements 5 and 6 yielded the very similar results. 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 Naïve Bayes (multinomial) Naïve Bayes (multinomial w/ enhancements 5 & 6) None Chi-2 (5613*) KL (5000) KL (10000) Our theory about this is that lines that start with ">" have the effect of upweighting important words of the message. In a long thread, the starting message (i.e. the topic) is usually repeated in every reply. We also tried several other combinations of 1-6. However, none resulted in a performance increase. Support Vector Machine Classification We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass problem we used the one-against-all approach described in [1]. We experimented with three kernel types: linear, polynomial and the radial basis (i.e. "Gaussian" kernel). We found that the polynomial kernel (with default parameters) performed poorly. There may be better parameter choices, but we weren't able to systematically explore the polynomial kernel parameter space (i.e. there is no analogous grid.py tool for the polynomial kernel) to find them. The parameter-less linear kernel performed quite well during testing but was outperformed by a radial basis kernel with tuned parameters. To determine the radial basis parameters we used the grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold cross validation as the objective function, it tries to solve the optimization problem of maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular classification problem. With knowledge of which kernel parameters to use, we wrote a pure-java implementation of an SVM classifier for the 20 newsgroups data set. During training, it converts each message to a normalized tf-idf vector and then builds 20 one-against-all models in memory. It is possible to save the training, testing and model data to the hard drive by using the appropriate constructor method in the SVMClassifier class. See the results table for a summary of the SVM classifier's performance compared with the other classifiers we tested. Other Classifiers Language Model Classifier LingPipe implements a sophisticated language model classifier that constructs a process language model per category and a multivariate distribution over the categories. Scores are computed as adjusted sample cross-entropy rates, which allow between-document comparisons. We built a wrapper class called LanguageModelClassifier that uses LingPipe's language model classifier to train on and classify newsgroup messages. Since word position affects language model construction, we did not combine this classifier with any feature selection methods. Running 10-fold cross-validation with this classifier yielded impressive results. At 98.6%, it has the highest accuracy of all the classifiers we tested. KNN Classifier We used also tested LingPipe's KNN classifier on the newsgroup classification problem. Without filtering, this classifier performs poorly. This is a consequence of the similarity measure used, in which documents are converted to a vector of boolean variables that indicate whether a given feature is present or not. This means that all features are treated equally and therefore the classifier performs poorly when only a subset of the features is useful at discriminating between classes. This is described in more detail in our discussion of chi-squared feature selection. As show in the results section, the accuracy of the KNN classifier is the highest when only a relatively small number of discriminating features are kept. Feature Selection with KL Divergence We implemented the KL divergence-based feature selector described in [2] and tested it with our classifier implementations. As the author claims, the KL divergence-based feature selection method (referred to as KL from now on) described in [2] performs as well as the approach based on mutual information (Chi2). However, we do not expect this to always be the case. Chi2 guarantees that the set of retained features F contains N top features from every class (i.e. each class will be equally represented in the set F). This can be done because the feature score (i.e. mutual information) is a function of feature and class, allowing the best features within a class to be identified. On the other hand, the score in KL a function only of feature. This means that it not possible to select the top N features within a particular class. Therefore one risk in using KL is that a particular class may be overrepresented/underrepresented in the set of top features. It could be the case that one of the classes consists of a larger proportion of highscoring KL words, in which case the set of top features will contain a disproportionate number of words from this class. Should this happen then the weaker signals present in the other classes will be inadvertently excluded from the training data. References [1] [2] [3] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press. K. Schneider. 2004. A new feature selection score for multinomial naive Bayes text classification based on KL-divergence. In The Companion Volume to the Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July.