Bayesian Spam Filtering By Rachael Bornstein and Daniel Miao November 30, 2011 Math 20: Discrete Probability Professor Zach Hamaker 1 Despite communication technology’s continual advances, electronic mail (e-mail) has remained an ever-popular method of communication since its invention in the 1970s. E-mail is fast, cheap, and very easy to use. For these reasons, e-mail remains extremely popular as not only an outlet for “friends and colleagues to exchange messages, but also a medium for conducting electronic commerce.”1 People love email for its simplicity and cost effectiveness; which unfortunately leads to spammers loving it for the exact same reasons. Spammers are businessmen who blindly send unsolicited bulk messages to large volumes of people as a form of advertising. Spammers keep sending spam because although the response rate to these emails is extremely low—“at best 15 per million vs 3000 per million for a catalog mailing”—the cost is essentially nothing for the spammer.2 According to Symantec’s “State of Spam: A Monthly Report” in August 2009, “overall spam volumes averaged 87 percent of all email messages in August 2009.”3 To save e-mail users from wasting time going through and manually deleting spam messages, mathematical spam-filtering methods have been created to separate spam from legitimate e-mails. Modern spam-filtering uses probabilistic methods as a means of differentiating between spam and non-spam emails. The underlying principle behind spam-filtering is called the Naïve Bayes Classifier. In this method of categorizing spam, an e-mail is represented by a array x = (x1,x2,x3,…,xn), where x1,…,xn are characteristics (in this case, specific words, eg, “Viagra”) X1, …, Xn.4 Let Xi = 1 if this particular characteristic exists in the e-mail message; and if characteristic Xi is not present, let Xi = 0.4 To find which words, out of all possible characteristics, are the most significant in detecting spam, a variable called “mutual information (MI)” is determined for each candidate word X.4 The following equation is used to calculate MI: , 2 where C is a variable that denotes a category (e.g. either spam or legitimate).4 Essentially, this equation finds the probability that a word belongs in a spam e-mail, based on its frequency of appearance ratio in a training set.4A training set is a set of emails (several thousand are needed in order to produce accurate results) that are pre-labeled as spam and non-spam. The spam filter works by calculating the MI’s for the words in every message assigned to both categories of e-mail, and in doing so, creates spam probabilities that are specifically tailored for the individual user. The words with the highest MI are selected to be used as the characteristics to detect the likelihood that an e-mail is spam.4 Once the key spam characterizing words have been identified, Bayes’ theorem can be used to find the probability that an e-mail belongs to a category c, spam or non-spam.4Since Bayes theorem is being used, it must be assumed that the appearance of certain words in a message are independent events. Note, however, that in actuality some words are more likely to be found together (e.g. “free” and “Viagra” are more likely to be found together than “Dartmouth” and “Viagra”). However, this assumption is justifiable because, “several studies have found the Naive Bayesian classifier to be surprisingly effective (Langley et al., 1992; Domingos&Pazzani, 1996), despite the fact that its independence assumption is usually over-simplistic.”4 Bayes Theorem gives the following equation to find the probability that an e-mail is spam (i.e. in category c), given the array x = (x1,x2,x3,…,xn):4 , where P(Xi |C) (e.g. the probability that an e-mail has particular word given the e-mail is spam) and P(C) (e.g. the probability that the e-mail is spam) are relative frequencies that are estimated using the training set.4 In words, this equation says for one word: 3 . Therefore, for the entire e-mail, which contains the array x words, these probabilities are multiplied, as seen in the numerical equation above. The final step to Naïve Bayesian classification is determining a threshold for classification. Mistaking a legitimate e-mail as spam is a much more severe problem than letting spam get through the filter by being mistakenly classified at legitimate.4 By assuming that the earlier error is λ times more costly than the latter, an e-mail can be sorted into the spam category if:4 . This equation says that given an e-mail has the array x words, the ratio of the probability that it is spam to the probability that it is legitimate must be greater than λ to be placed into the spam folder. For example, the threshold of λ = 999 could be set to say that classifying a legitimate e-mail as spam is as bad as having 999 spam e-mails clutter one’s inbox.4 While this threshold may seem very high, it is a sensible value when there is no further processing of e-mails categorized as spam, since most people would agree that blocking a legitimate email is not acceptable.4 The first viable spam filtering system built upon the principles of the Naïve Bayesian Classifier was created and discussed in the 1998 scholarly paper, “A Bayesian Approach to Filtering Junk E-Mail” by Sahami et al.1 Sahami’s work was revolutionary for spam filtering, in the sense that prior to his work spam filtering was rule based and inadaptable.4 Inadaptable spam filtering is problematic because spam is ever-changing, and spammers constantly find new ways to try to beat the system.1 According to Sahami, the rule based filtering method has limited utility because of the fact that filtering-rule-sets generally make rigid binary decisions to decide whether an e-mail should be categorized as spam.1 For example, in a rule based system, an e-mail that contains the word “Viagra” might be automatically sent to the spam 4 folder whereas the Bayesian filter might mark this message as probable spam and then take into account other factors that could outweigh the word “Viagra” to indicate a legitimate email.5 Because misclassification of a legitimate e-mail is more costly than misclassifying spam e-mail, Sahami created the “utility model” based on Bayesian Classifying to make sure this “difference in cost between the two types of errors” is taken into account.1 In Sahami’s method, each user will personally help create the probable spam probabilities when sorting the training set, and then over time, these probabilities can be adapted to changing spam by the user’s indication of wrongly sorted messages. Since each individual user has the ability to manually flag the spam emails he gets in his inbox, the MI’s used to classify emails as spam constantly readjust and are personal to the individual. Sahami’s Bayesian filtering approach made major improvements in spam filtering by being the first adaptable and user specific spam filtering method.1 As an extension of the Naïve Bayesian Classifying method described above, Sahami considered supplementary features of e-mails, called “domain specific properties,” in addition to the words of the message, to classify e-mails more accurately.1 Domain specific properties include, “particular phrases, such as “Free Money,” or over-emphasized punctuation, such as “!!!!,” [that] are indicative of junk E-mail.”1 The domain of the sender’s e-mail address also provides insight as to whether or not an e-mail could be spam.1 For instance, “.edu” domain e-mail addresses essentially never send spam e-mails.1 To incorporate these additional characteristics into the Bayesian Classification model, the domain specific properties can simply be added as additional variables in the characteristic array for the e-mail.1 Thus detection of the presence of particular domain specific properties can be “uniformly incorporated into the classification models and the learning algorithms employed need not be modified.”1 5 Sahami used an array of 500 features, or tokens, including key words, phrases, and domain-specific properties, based on the training set, to build his classifier.1 Note that Sahami did not use a large of number experiments to arrive at 500 as the best number of characteristics to use when filtering, but he found that this value delivered reliable results after initial experiments.1 In experimenting, Sahami used a medium sized training set of 1538 e-mails and a testing set of 251 e-mails.1Sahami’s Bayesian Classifier was set up to consider word-based characteristics, as well as 35 phrasal features, and 20 non-textual domain specific properties to detect spam.1 To incorporate cost-sensitive error detection, a threshold of 99.9% or greater probability of being spam was used to categorize an e-mail as spam.1 As a result, Sahami found that the percentage of true spam classified as spam (spam precision) was 97.1% when using words only, 97.6% when using words and phrases, and close to 100% when using words, phrases, and domain specific features.1 In this experiment, the legitimate mail precision was 87.7% when using words only, 87.8% when using words and phrases, and 96.2 % when using words, phrases, and domain specific features.1 Thus, the incorporation of characteristics, like domain specific properties, results in far more accurate classification in comparison to just using words.1 While Sahami provided a much improved model for spam filtering by using Bayesian Classification, the number of false-positives (legitimate e-mails mistakenly categorized as spam) was still too high to be satisfactory.1 In 2002, Paul Graham came up with additional improvements to Sahami’s methods to greatly reduce the false-positive rate.5 According to Graham’s model, spam filtering can “now miss less than 5 per 1000 spams, with 0 false positives.”2 Graham essentially created a much more complex filter that could detect many more features, however, his filter is still based on the same mathematical principles of Bayesian filtering.6 Graham realized that headers can tell a lot about the likelihood of an email being spam, and thus they cannot be ignored.6 Graham also took into account where 6 particular words are located within the message.6 Other parts of an e-mail Graham used as spam indicators include attachments of URLs and the use of colored fonts.2By combining these factors, as well as others, Graham created far more recognizable tokens (words/features). Instead of the 500 or so tokens that Sahami trained his filters to recognize, Graham utilized tens of thousands. He showed that ever more precise tokenization and token recognition lead to higher levels of spam detection while lowering rates for false positives. However, this also leads to a significantly higher risk that the user will not bother to check his spam folder for false positives (which is problematic because it is almost inevitable that a legitimate email will one day end up in the spam box). While spammers are continually altering their methods to try and find ways to beat the spam filtering system, so far, the Naïve Bayesian Classifying based spam-filtering method has proved to be both adaptable and sufficient. The use of several different characteristics simultaneously, besides just words, to determine the probability that an e-mail is spam, has significantly reduced the false-positive rates. However, because mistakenly classifying legitimate e-mails as spam is so highly unacceptable, Graham and others are still working to find more ways to eliminate this problem completely for spam filters of the future. One possible way to address the continual problem of false-positives is to find a way for future spam filters to recognize non-spam features—an idea that Graham has been working on, however has not come up with any methods to do this yet.2While spam filtering today may not be perfect, modern filters based upon the Naïve Bayesian method have been accurate enough that our inboxes are not filled with spam and our important emails rarely get shuffled into the spam box. The very fact that most people utilize email and trust their filters proves that these methods based off the Naïve Bayesian Classifier truly work. These filters have saved email from turning into a denizen of the past. 7 References 1. Sahami, M. 1998. A Bayesian Approach to Filtering Junk E-Mail. Microsoft Research. [Online]. http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf. 2. Graham, P. 2002. A Plan For Spam. [Online].http://www.paulgraham.com/spam.html 3. State of Spam: A Monthly Report. Symantec [Online]. 2009 [cited 2011 Nov. 13]. http://sup.kathimerini.gr/xtra/media/files/var/spam_909.pdf. 4. Androutsopoulos, I. An Evaluation of Naïve Bayesian Anti-Spam Filtering. Arxiv [Online]. 2000 [cited 2011 Nov 13]. 9-17. http://arxiv.org/pdf/cs/0006013v1. 5. Bayesian Spam Filtering. [Online]. http://en.wikipedia.org/wiki/Bayesian_spam_filtering 6. Graham, P. 2003. Better Bayesian Filtering. [Online]. http://www.paulgraham.com/better.html. 8