CS276: Programming Assignment 2
Richard Frankel, Tim Harrington
frankel@cs.stanford.edu, taharrin@stanford.edu
Naïve Bayes Classification
Discuss accuracy of Multivariate Naïve Bayes and Multinomial Naïve Bayes
While the multivariate (Bernoulli) model and the multinomial model are both equally valid
models to use with a Naïve Bayes classifier, they differ in ways that affect classification
accuracy. While the multinomial model keeps track of the number of word occurrences in a
document, the multivariate model keeps only a binary record.
Chi-squared Feature Selection
Effects of and a discussion on chi-squared feature selection.
K-fold Cross Validation
Comment on the changes in accuracy due to and implications of applying k-fold cross-validation
Transformed Weight-normalized Complement Naïve Bayes
Compare the Naïve Bayes classifier with the Transformed Weight-normalized Complement
Naïve Bayes (TWCNB) classifier.
Domain Specific Techniques
Results from applying domain-specific techniques/experimenting with different techniques
Support Vector Machine Classification
We used LIBSVM to build an SVM classifier for the newsgroup data. Since this is a multiclass
problem we used the one-against-all approach described in [1]. We experimented with three
kernel types: linear, polynomial and the radial basis (i.e. "Gaussian" kernel). We found that the
polynomial kernel (with default parameters) performed poorly. There may be better parameter
choices, but we weren't able to systematically explore the polynomial kernel parameter space
(i.e. there is no analogous grid.py tool for the polynomial kernel) to find them.
The parameter-less linear kernel performed quite well during testing but was outperformed by a
radial basis kernel with tuned parameters. To determine the radial basis parameters we used the
grid.py tool provided with LIBSVM. Treating the radial basis parameters as variables and 5-fold
cross validation as the objective function, it tries to solve the optimization problem of
maximizing classifier accuracy. Applying this process to training data from the 20 newsgroups
data set revealed that gamma=0.5 and C=8 are the best radial basis parameters for this particular
classification problem.
With knowledge of which kernel parameters to use, we implemented an SVM classifier for the
20 newsgroups data set. During training, it converts each message to a normalized tf-idf vector
and then builds 20 one-against-all models in memory. It is possible to save the training, testing
and model data to the hard drive by changing the SAVE_FOR_OFFLINE_USE member variable
to true. See the "10-fold classifier accuracies" table for a summary of the SVM classifier's
performance on the 20 newsgroups data set.
Other Classifiers
Language Model Classifier
LingPipe implements a sophisticated language model classifier that constructs a (process)
language model per category and a multivariate distribution over the categories. Scores are
computed as adjusted sample cross-entropy rates, which allow between-document comparisons.
We built a wrapper class called LanguageModelClassifier that uses LingPipe's language model
classifier to train on and classify newsgroup messages. Since word position affects language
model construction, we did not combine this classifier with any feature selection methods.
Running 10-fold cross-validation with this classifier yielded impressive results. At 98.6%, it has
the highest accuracy of all the classifiers we tested.
KNN Classifier
We used also tested LingPipe's KNN classifier on the newsgroup classification problem. Without
filtering, this classifier performs poorly. This is a consequence of the similarity measure used, in
which documents are converted to a vector of boolean variables that indicate whether a given
feature is present or not. As is the case with multivariate Naïve Bayes, the effect of this is that all
features are treated equally and the classifier performs poorly on larger documents.
Feature Selection with KL Divergence
We implemented the KL divergence-based feature selector described in [2] and tested it with our
classifier implementations. The results shown in the table below are expected. As the author
claims, the KL divergence-based feature selection method (referred to as KL from now on)
described in [2] performs as well as the approach based on mutual information (Chi2). However,
we do not expect this to always be the case.
Chi2 guarantees that the set of retained features F contains N top features from every class (i.e.
each class will be equally represented in the set F). This can be done because the feature score
(i.e. mutual information) is a function of feature and class, allowing the best features within a
class to be identified. On the other hand, the score in KL a function only of feature. This means
that it not possible to select the top N features within a particular class. Therefore one risk in
using KL is that a particular class may be overrepresented/underrepresented in the set of top
features. It could be the case that one of the classes consists of a larger proportion of highscoring KL words, in which case the set of top features will contain a disproportionate number of
words from this class. Should this happen then the weaker signals present in the other classes
will be inadvertently excluded from the training data.
Results: Classifier Accuracy Comparison
Feature Selection Method (number of features)
Deliverable #
Chi-2 (5613*)
KL (5000)
KL (10000)
Naïve Bayes (multivariate)
Naïve Bayes (multinomial)
Naïve Bayes (TWCNB)
Naïve Bayes (with domainspecific feature selection)
SVM (radial basis kernel)
SVM (linear kernel)
Language model (LingPipe)
KNN (LingPipe, K=3)
This table shows 10-fold cross-validation accuracies for various combinations of classifiers and
feature selectors. *This is the size of the set containing the top 300 features from each class.
