Algorithms for Spam Classification Bridget Kelly Kyle McCombs University of North Carolina at Wilmington CSC 380 Abstract There are different text classification methods that are used to identify spam emails. A well-known and widely used method is the naïve bayes classifier. This paper will explore this method and the markhov model in terms of success in correctly identifying spam emails. The success rates of these two varying methods, one of which assumes independence in the model, and one of which utilizes chaining, will be compared to see which one is more effective. 1. Introduction: A spam email is defined as an unsolicited, usually commercial, email that is sent to a large number of addresses. The prevalence of spam is not a figure that is universally agreed upon, studies vary in stating that spam emails account for anywhere from 70 – 95 % of all email correspondence today.[3] Machine learning text classification techniques are an accurate way to identify spam emails, and in this paper two of these such methods will be explored – the naïve bayes classifier, and the markhov model. 2. Methods: In order to test our classification methods, we used a spam archive found on the website http://untroubled.org/spam/. This archive consists of text files of spam emails all collected from one individual, the earliest from 1998, and the archive is regularly updated with the most recent spam emails. The emails from this archive were used for both our training set and our testing set for the spam classification methods explored in this paper. Our methods focused on achieving a high success rate of correctly identifying spam emails, and reducing false negative identification. We did not focus on identifying non-spam emails or false positive identification. For both methods of identifying spam, our process began the same way. We developed a program that reads in the spam text files and parses through the body of the email word by word. We create a word bank that saves each word that occurs in every spam email we evaluate and the frequency of that word’s appearance in all of the emails. This information is used to associate each word with a probability that that word is in any given spam email. These individual probabilities are used in the naïve bayes classifier, and the markhov model. 2.1 Naïve Bayes Classifier: The naïve bayes classifier is one of the oldest and most commonly used methods to identify spam emails, its first known usage being in 1996. This classifier utilizes the “bag of words” text identification model, the important features of this being that the model disregards grammar, word order, and punctuation, but not multiplicity of the word. Essentially when using this model, every structure that is evaluated, be it a sentence, or in our case, an email, is seen simply as a collection of words.[2] Another important component when understanding the naïve bayes classifier is that it operates under the assumption of independence among features in the model, the value of any feature is unrelated to the presence or absence of any other feature. In terms of this application, this means that each word contributes an individual probability to the email in question being spam. [1] In terms of implementing the naïve bayes classifier for spam identification in our application, the following formula is used: This formula considers each word’s effect on the email in question being spam. The p here represents the probability that the email in question is spam, the p1, p2, etc. represent the probability associated with each word in the email, and the n represents the total number of words within the email being evaluated. This formula does not take into consideration the probability of any given email being spam regardless of contents (for example considering that 85% of all emails are spam), because doing this can result in more frequent false positive identification. Instead of the multiplication shown in the formula, the classifier value is actually calculated using addition in the log domain within our program to eliminate floatingpoint underflow. After calculating the classifier value from the formula above, that is then compared to a threshold to determine if an email can be classified as spam or not. When implementing our naïve bayes classifier, we evaluated 15,000 emails in our testing phase. Our preliminary threshold value was the average classifier value from the 15,000 emails in the testing phase. This seemed like a reasonable option for the threshold because it is the mean of the naïve bayes classifier value for 15,000 emails that are confirmed to be spam. When using this as the threshold, we had a 2.86% success rate (86/3,000 emails correctly identified) in identifying spam emails during our testing phase. After running some statistical analysis on a dataset comprised of the 15,000 naïve bayes values from the learning set, we found that those values are not normally distributed. In terms of the effectiveness of our classifier, this is irrelevant, but often the mean is not the best representation for data that is not normally distributed, so this led us to explore other options for the threshold to raise our success rate. When using the median value of the 15,000 classifier values as the threshold for the 3,000 emails in our testing set, we had a success rate of 52.03%, or 1,561 out of 3,000 emails correctly identified. In order to improve the model further, only the fifteen words with the highest associated spam probabilities were included in the model for determining the overall spam probability of a given email. This is done because no matter how many neutral words there are (or words associated with a low spam probability), they should not affect the statistical significance of highly incriminating spam words. When using this model, we had a success rate of 97.13%, or 2,914 out of 3,000 emails correctly identified as spam. The complexity of running the naïve bayes classifier on any given email is O(n), where n is the number of words within the email is evaluated. 2.2 Markhov Model: The second approach we implemented to construct a spam-email classifying algorithm involved a Markhov Model. Markhov Models are used for modeling randomly-changing systems in which it is assumed that future states depend only on the present state and not on the sequence of events that preceded it. The process of implementation began with obtaining a training set of emails, which were all confidently classified as spam emails. The training set we used contained 3000 spam emails, all written as text files. Parsing through every single pair of adjacent words in the entire training set, we calculated and stored the average occurrence rate that word ‘B’ followed word ‘A’ in all spam emails in the training set, for all pairs of adjacent words unless one of the words contained a period, exclamation mark, or question mark. The presence of one of the mentioned punctuation implies that the two words being examined, are not directly related, thus the occurrence rate between the two words is irrelevant. Once we have the average occurrence rate that a word follows another word calculated for every word in the training set, we create a weighted directed graph to store theses values. The vertices of the graph are composed of every distinct word found in the training set. Each vertex has edges going to other vertices which represents the corresponding adjacent words. Each edge contains a weight, which corresponds to the average occurrence rate between the two vertices. Once all of the occurrence rates are stored, we can now classify an email. To classify an email in question, we parsed through the email examining adjacent words, then looked up the average occurrence rate for each of these pairs. If the edge is not found, this means that the two words in question never appeared beside each other in the training set, and ultimately does not make any contribution to the total grade of the email. After summing these average occurrence rates and then finding the average, this gives us the final grade of the email in question. To determine a threshold for declaring an email spam or not-spam, we ran the above algorithm on the training set and took the average grade of all of the spam emails examined. To test our algorithm, we obtained another set of spamemails that contained 1000 spam emails. We then graded each email in the test set using the determined threshold, if the grade of the email in question was lower than the threshold, we declared it non-spam, if the grade was above the threshold, it was declared spam. The final results revealed that our algorithm had a 92% success rate at classifying spam emails. Out of 1000 spam emails in our test set, 920 of them were classified as spam by our algorithm. 3. Conclusion: The naïve bayes classifier and the markhov model are different in several ways, but the most important way in terms of our application is that the naïve bayes assumes independence among features and the markhov model utilizes chaining to evaluate a words associated probability with another word. The markhov model is more difficult to implement compared to the somewhat simplistic naïve bayes classifier, and has a greater time complexity when running. After completing testing for both of these classification models, the naïve bayes classifier had a higher success rate than the markhov model. The simplistic naïve bayes model is still heavily used today to identify spam emails because of its simplicity and high success rate. 4. References: [1] Jiang, Liangxiao, Harry Zhang, and Zhihua Cai. "A Novel Bayes Model: Hidden Naive Bayes." IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 21.10 (2009). IEEE Xplore Digital Library. Web. 1 Nov. 2014. <http://ieeexplore.ieee.org/xpl/freeabs_all.js p?arnumber=4721435&abstractAccess=no& userType=inst>. [2] A. Almeida, Tiago, Akebo Yamakami, and Jurandy Almeida. "Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters." 2009 International Conference on Machine Learning and Applications (2009). IEEE Xplore Digital Library. Web. 1 Nov. 2014. <http://ieeexplore.ieee.org/xpl/freeabs_all.js p?arnumber=5381433&abstractAccess=no& userType=inst>. [3] Graham, Paul. "Better Bayesian Filtering." Web. <http://www.paulgraham.com/better.html>.