Naïve Bayes Classification

advertisement
CS276: Programming Assignment 2
Richard Frankel, Tim Harrington
frankel@cs.stanford.edu, taharrin@stanford.edu
Classification Results
This following table shows the 10-fold cross-validation accuracies we observed for the different
combinations of classifiers and feature selection methods.
Deliverable #
Classifier
1
3
5
Naïve Bayes (multivariate)
Naïve Bayes (multinomial)
Naïve Bayes (CNB)
5
5
7
7
8
8
Naïve Bayes (WCNB)
Naïve Bayes (TWCNB)
SVM (radial basis kernel)
SVM (linear kernel)
Language model (LingPipe)
KNN (LingPipe, K=3)
Feature Selection Method (number of features kept)
None
Chi-2 (5613*) KL (5000)
KL (10000)
0.83927804
0.93500140
0.85209885
0.88252699
0.81661972
0.88021420
0.83333228
0.90783690
0.96940279
0.95718356
0.98562993
0.82575208
0.95426271
0.90905059
N/A
0.84297618
0.96144794
0.93589140
N/A
0.85628772
0.96436306
0.94576513
N/A
0.84332968
*This is the size of the set containing the top 300 features from each class.
Naïve Bayes Classification
While the multivariate (Bernoulli) model and the multinomial model are both equally valid
models to use with a Naïve Bayes classifier, they differ in ways that affect classification
accuracy. One important difference is that the multinomial model keeps track of the number of
word occurrences in a document, while the multivariate model keeps only a binary record. In the
multivariate case this means all features are weighted equally. Since documents are longer in the
20 newsgroups dataset, the equal weighting of features causes poor classification accuracy when
compared with multinomial Naïve Bayes, which weighs scores by feature. As shown in the
results section, this is reflected in the classification accuracies we observed for the two methods.
Chi-squared Feature Selection
Chi-squared feature selection uses the chi-squared sample statistic to measure the lack of
independence between a feature and a class. The higher the value, the better discriminator the
feature is for the class. For some classification methods applying chi-squared feature selection
results in greater classification accuracy while for other methods it worsens accuracy. For
example, classifiers that treat all features equally (e.g. multivariate Naïve Bayes or KNN) tend to
perform better when chi-squared feature selection is used. This is due to the fact that only
important features (by the chi-squared measure) are retained, eliminating the bad features (that
were previously given equal weight) from influencing classification results.
On the other hand, chi-squared feature selection worsens the accuracy of classifiers that already
treat some features as more important than others. Two examples of classifiers that do this are
SVM and multinomial Naïve Bayes. With SVM, the classifier considers only the features at the
margin of the classes and with multinomial Naïve Bayes, feature contributions to the
classification scores are proportional to how many times they appear in the corpus. Classifiers
like these glean signal from all the features, but weigh the feature contributions by the some
measure of the quality of information that they bear. As such, using feature selection to reduce
the number of features necessarily gives the classifiers less information to work with and, hence,
the classification accuracies suffer. There is a benefit to doing this though. When computational
costs are high and information must be discarded/ignored, then chi-squared feature selection
provides a way to identify the features that are safest to ignore (i.e. that will harm accuracy the
least).
K-fold Cross Validation
K-fold cross-validation is a technique used to assess how well classifier performance will
generalize to other data sets. In other words, it provides a way to estimate how well the classifier
will perform in practice. It is a useful method for testing hypothesis suggested by the data. The
accuracy in general appears lower than simply training on all the data and then re-using the for
classifier testing. Intuitively, it makes sense that testing the classifier with previously seen
training examples will result in higher accuracy than when testing with never-seen-before
examples.
We implemented an honest k-fold cross-validation class. To calculate the folds each newsgroup
was first split into K chunks, with the intergroup chunks of size as equal as possible (e.g. with
k=3, a group of size 11 gets split into chunks of size 4,4,3). The chunks in each group were then
number from 1 to k. To construct the training set for fold j, the chunks not equal to j were
merged from all newsgroups. To construct the testing set we did the same except we excluded all
chunks not equal to j.
Transformed Weight-normalized Complement Naïve Bayes
We implemented TWCNB in three layers on top of our Multinomial Naïve Bayes classifier.
First, we implemented Complement Naïve Bayes using the suggested values for  i and  . This
achieved the following results using 10-Fold Cross-Validation. There was a large improvement
here over Multinomial NB.
<results> Tim: How do you print the results by class?
Next, we implemented Weight-normalized Complement Naïve Bayes, with the following results,
and got a minor improvement:
<results>
Finally, we implemented Transformed Weight-normalized Complement Naïve Bayes, which
actually resulted in a slight decrease in performance:
<results>
…again, not sure what else to say here. Not really sure why TWCNB was worse than WCNB –
do you have any ideas?
Domain Specific Techniques
We first implemented the following domain-specific enhancements at the parsing level.
1. Mapping of numbers to a special NUM token
2. Mapping of email addresses to a special EMAIL token
3. Mapping of hyperlinks to a special LINK token
4. Mapping of prices (dollars, pounds, Euros, and yen) to a special PRICE token
5. Skipping of quote lines (lines beginning with ‘>’). Our intent here was to avoid the
double-counting of quoted replies.
6. Upweighting of subject lines.
Accuracy
Interestingly, the above techniques actually worsened the multinomial classifier accuracy.
Simply using enhancements 1 and 4 produced the following results, which suggests that the
numbers and prices present in the 20 newsgroups data set are useful for discriminating between
classes.
0.94
0.92
0.9
0.88
0.86
0.84
0.82
Naïve Bayes (multinomial)
Naïve Bayes (multinomial w/
enhancements 1 & 4)
None
Chi-2
(5613*)
KL (5000) KL (10000)
Accuracy
Using only enhancements 5 and 6 yielded the very similar results.
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
Naïve Bayes (multinomial)
Naïve Bayes (multinomial w/
enhancements 5 & 6)
None
Chi-2 (5613*)
KL (5000)
KL (10000)
Our theory about this is that lines that start with ">" have the effect of upweighting important
words of the message. In a long thread, the starting message (i.e. the topic) is usually repeated in
every reply.
We also tried several other combinations of 1-6. However, none resulted in a performance
increase.
Support Vector Machine Classification
We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass
problem we used the one-against-all approach described in [1]. We experimented with three
kernel types: linear, polynomial and the radial basis (i.e. "Gaussian" kernel). We found that the
polynomial kernel (with default parameters) performed poorly. There may be better parameter
choices, but we weren't able to systematically explore the polynomial kernel parameter space
(i.e. there is no analogous grid.py tool for the polynomial kernel) to find them.
The parameter-less linear kernel performed quite well during testing but was outperformed by a
radial basis kernel with tuned parameters. To determine the radial basis parameters we used the
grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold
cross validation as the objective function, it tries to solve the optimization problem of
maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups
data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular
classification problem.
With knowledge of which kernel parameters to use, we wrote a pure-java implementation of an
SVM classifier for the 20 newsgroups data set. During training, it converts each message to a
normalized tf-idf vector and then builds 20 one-against-all models in memory. It is possible to
save the training, testing and model data to the hard drive by using the appropriate constructor
method in the SVMClassifier class. See the results table for a summary of the SVM classifier's
performance compared with the other classifiers we tested.
Other Classifiers
Language Model Classifier
LingPipe implements a sophisticated language model classifier that constructs a process
language model per category and a multivariate distribution over the categories. Scores are
computed as adjusted sample cross-entropy rates, which allow between-document comparisons.
We built a wrapper class called LanguageModelClassifier that uses LingPipe's language model
classifier to train on and classify newsgroup messages. Since word position affects language
model construction, we did not combine this classifier with any feature selection methods.
Running 10-fold cross-validation with this classifier yielded impressive results. At 98.6%, it has
the highest accuracy of all the classifiers we tested.
KNN Classifier
We used also tested LingPipe's KNN classifier on the newsgroup classification problem. Without
filtering, this classifier performs poorly. This is a consequence of the similarity measure used, in
which documents are converted to a vector of boolean variables that indicate whether a given
feature is present or not. This means that all features are treated equally and therefore the
classifier performs poorly when only a subset of the features is useful at discriminating between
classes. This is described in more detail in our discussion of chi-squared feature selection. As
show in the results section, the accuracy of the KNN classifier is the highest when only a
relatively small number of discriminating features are kept.
Feature Selection with KL Divergence
We implemented the KL divergence-based feature selector described in [2] and tested it with our
classifier implementations. As the author claims, the KL divergence-based feature selection
method (referred to as KL from now on) described in [2] performs as well as the approach based
on mutual information (Chi2). However, we do not expect this to always be the case.
Chi2 guarantees that the set of retained features F contains N top features from every class (i.e.
each class will be equally represented in the set F). This can be done because the feature score
(i.e. mutual information) is a function of feature and class, allowing the best features within a
class to be identified. On the other hand, the score in KL a function only of feature. This means
that it not possible to select the top N features within a particular class. Therefore one risk in
using KL is that a particular class may be overrepresented/underrepresented in the set of top
features. It could be the case that one of the classes consists of a larger proportion of highscoring KL words, in which case the set of top features will contain a disproportionate number of
words from this class. Should this happen then the weaker signals present in the other classes
will be inadvertently excluded from the training data.
References
[1]
[2]
[3]
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector
machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval.
Cambridge University Press.
K. Schneider. 2004. A new feature selection score for multinomial naive Bayes text
classification based on KL-divergence. In The Companion Volume to the Proceedings of
42nd Annual Meeting of the Association for Computational Linguistics, Barcelona,
Spain, July.
Download