Text Classification with a Naïve Bayes Classifier

advertisement
Peer Review Version
CIS 730/732 Project
Text Classification with a
Naïve Bayes Classifier
By: Esteban Guillen
Problem Statement / Objective
The problem address in this project was to learn to classify (guess the category/group)
unlabeled text documents. The problem would be solved by taking a large set of labeled
(the category/group of the document) text documents and building a naïve bayes
classifier from those documents. The naïve bayes classifier would then be able to
classify an unlabeled example based on the information learned from the labeled
examples.
Significance
The scope of the project is important because of the vast amount of information available
on the internet. The ability to classify unlabeled documents would lead to easier access
for those doing research in a specific area. Search engines like Google could return better
results if information was better organized.
This project also illustrates how simple models like a naïve baye can lead to accurate
results.
Background
This problem is well documented. I got my idea for this project from Tom Mitchell’s
Machine Learning book. In his book Mitchell had a collection of 20,000 documents from
20 yahoo news groups. He used a naïve bayes classifier to classify a sub set of those
20,000 documents. He achieved about 89% accuracy in classification. My goal would
be to achieve similar results.
Methodology
As mentioned above I used a naïve bayes classifier to classify unlabeled documents. I
was able to download (from Mitchell’s web site) the same 20,000 news group documents
that Mitchell used in his book. A subset of those documents was used as the training set
to build the classifier and the rest was used to test the classifier for accuracy.
The first step in building the naïve bayes classifier was to build a vector which consisted
of all the relevant words found in the train set of documents. I defined relevant words to
be those which occurred more than 5 times but less than 15000 times in the training set.
This would eliminate the very rare words that occurred and also the very common words
that occur in almost every text document. I thought these words would provide little
information and it would be best to eliminate them. After parsing through the initial
training set (18,000 out of the 20,000) I got a word count of 38,500. I will refer to this set
(no repeated words) of words as the Vocabulary.
The next step is to calculate the probabilities for each category in the naïve bayes
classifier. This was accomplished by first calculating the prior probability P(v) for each
category; which was 1/20 for each category. Next I calculated the probability given a
word from the Vocabulary (wk) that the category will be v. This posterior probability
will be refered to as P(wk|v). Each document in each category was parsed and a hash
table was created for each category. The keys of the hash table consisted of all the words
in the Vocabulary. The values in the hash table were the number of times the word
occurred (nk) in all the documents in that category. For example if the category was
“guns” and gun occurred 3 times in each of the 900 training documents then nk would be
2700 for that word. The total word count n (counting repeats) for each category was also
calculated. From these values we can now calculate P(wk|v) with the following equation.
P(wk|v) = ( nk +1) / ( n + |Vocabulary| )
To summarize the previous step there will be 20 P(v) values, each will be 1/20. Each of
the 20 categories will have 38,500 P(wk|v) values. Another hash table can be created for
each category consisting of the 38,500 words for the keys and the values would be that
words P(wk|v) value. The classifier is now completed.
To classify an unlabeled example it is just a matter of looking up the probabilities of a
given category in the hash table and multiplying them together. The category which
produced the highest probability would be the label/classification for the unlabeled
example. Only the words found in the unlabeled example would be looked up in the hash
table. The following equation would be used to classify an unlabeled example
v = argmax ( P(v) Π P(wk|v) )
If a word was found in the unlabeled testing example that didn’t exist in the original
Vocabulary which was built from the training examples then that word would be ignored.
Using this model I was able to classify with 87% accuracy
Code
The code can be found online at:
http://www-personal.ksu.edu/~ejg3500/NaiveBayes.java
The training set can be downloaded from:
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
Download