Peer Review Version CIS 730/732 Project Text Classification with a Naïve Bayes Classifier By: Esteban Guillen Problem Statement / Objective The problem address in this project was to learn to classify (guess the category/group) unlabeled text documents. The problem would be solved by taking a large set of labeled (the category/group of the document) text documents and building a naïve bayes classifier from those documents. The naïve bayes classifier would then be able to classify an unlabeled example based on the information learned from the labeled examples. Significance The scope of the project is important because of the vast amount of information available on the internet. The ability to classify unlabeled documents would lead to easier access for those doing research in a specific area. Search engines like Google could return better results if information was better organized. This project also illustrates how simple models like a naïve baye can lead to accurate results. Background This problem is well documented. I got my idea for this project from Tom Mitchell’s Machine Learning book. In his book Mitchell had a collection of 20,000 documents from 20 yahoo news groups. He used a naïve bayes classifier to classify a sub set of those 20,000 documents. He achieved about 89% accuracy in classification. My goal would be to achieve similar results. Methodology As mentioned above I used a naïve bayes classifier to classify unlabeled documents. I was able to download (from Mitchell’s web site) the same 20,000 news group documents that Mitchell used in his book. A subset of those documents was used as the training set to build the classifier and the rest was used to test the classifier for accuracy. The first step in building the naïve bayes classifier was to build a vector which consisted of all the relevant words found in the train set of documents. I defined relevant words to be those which occurred more than 5 times but less than 15000 times in the training set. This would eliminate the very rare words that occurred and also the very common words that occur in almost every text document. I thought these words would provide little information and it would be best to eliminate them. After parsing through the initial training set (18,000 out of the 20,000) I got a word count of 38,500. I will refer to this set (no repeated words) of words as the Vocabulary. The next step is to calculate the probabilities for each category in the naïve bayes classifier. This was accomplished by first calculating the prior probability P(v) for each category; which was 1/20 for each category. Next I calculated the probability given a word from the Vocabulary (wk) that the category will be v. This posterior probability will be refered to as P(wk|v). Each document in each category was parsed and a hash table was created for each category. The keys of the hash table consisted of all the words in the Vocabulary. The values in the hash table were the number of times the word occurred (nk) in all the documents in that category. For example if the category was “guns” and gun occurred 3 times in each of the 900 training documents then nk would be 2700 for that word. The total word count n (counting repeats) for each category was also calculated. From these values we can now calculate P(wk|v) with the following equation. P(wk|v) = ( nk +1) / ( n + |Vocabulary| ) To summarize the previous step there will be 20 P(v) values, each will be 1/20. Each of the 20 categories will have 38,500 P(wk|v) values. Another hash table can be created for each category consisting of the 38,500 words for the keys and the values would be that words P(wk|v) value. The classifier is now completed. To classify an unlabeled example it is just a matter of looking up the probabilities of a given category in the hash table and multiplying them together. The category which produced the highest probability would be the label/classification for the unlabeled example. Only the words found in the unlabeled example would be looked up in the hash table. The following equation would be used to classify an unlabeled example v = argmax ( P(v) Π P(wk|v) ) If a word was found in the unlabeled testing example that didn’t exist in the original Vocabulary which was built from the training examples then that word would be ignored. Using this model I was able to classify with 87% accuracy Code The code can be found online at: http://www-personal.ksu.edu/~ejg3500/NaiveBayes.java The training set can be downloaded from: http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html