Bayesian Spam Filtering

advertisement
Bayesian Spam Filtering
By Rachael Bornstein and Daniel Miao
November 30, 2011
Math 20: Discrete Probability
Professor Zach Hamaker
1
Despite communication technology’s continual advances, electronic mail (e-mail) has
remained an ever-popular method of communication since its invention in the 1970s. E-mail
is fast, cheap, and very easy to use. For these reasons, e-mail remains extremely popular as
not only an outlet for “friends and colleagues to exchange messages, but also a medium for
conducting electronic commerce.”1 People love email for its simplicity and cost
effectiveness; which unfortunately leads to spammers loving it for the exact same reasons.
Spammers are businessmen who blindly send unsolicited bulk messages to large volumes of
people as a form of advertising. Spammers keep sending spam because although the response
rate to these emails is extremely low—“at best 15 per million vs 3000 per million for a
catalog mailing”—the cost is essentially nothing for the spammer.2 According to Symantec’s
“State of Spam: A Monthly Report” in August 2009, “overall spam volumes averaged 87
percent of all email messages in August 2009.”3 To save e-mail users from wasting time
going through and manually deleting spam messages, mathematical spam-filtering methods
have been created to separate spam from legitimate e-mails.
Modern spam-filtering uses probabilistic methods as a means of differentiating
between spam and non-spam emails. The underlying principle behind spam-filtering is called
the Naïve Bayes Classifier. In this method of categorizing spam, an e-mail is represented by a
array x = (x1,x2,x3,…,xn), where x1,…,xn are characteristics (in this case, specific words, eg,
“Viagra”) X1, …, Xn.4 Let Xi = 1 if this particular characteristic exists in the e-mail message;
and if characteristic Xi is not present, let Xi = 0.4 To find which words, out of all possible
characteristics, are the most significant in detecting spam, a variable called “mutual
information (MI)” is determined for each candidate word X.4 The following equation is used
to calculate MI:
,
2
where C is a variable that denotes a category (e.g. either spam or legitimate).4 Essentially,
this equation finds the probability that a word belongs in a spam e-mail, based on its
frequency of appearance ratio in a training set.4A training set is a set of emails (several
thousand are needed in order to produce accurate results) that are pre-labeled as spam and
non-spam. The spam filter works by calculating the MI’s for the words in every message
assigned to both categories of e-mail, and in doing so, creates spam probabilities that are
specifically tailored for the individual user. The words with the highest MI are selected to be
used as the characteristics to detect the likelihood that an e-mail is spam.4
Once the key spam characterizing words have been identified, Bayes’ theorem can be
used to find the probability that an e-mail belongs to a category c, spam or non-spam.4Since
Bayes theorem is being used, it must be assumed that the appearance of certain words in a
message are independent events. Note, however, that in actuality some words are more likely
to be found together (e.g. “free” and “Viagra” are more likely to be found together than
“Dartmouth” and “Viagra”). However, this assumption is justifiable because, “several
studies have found the Naive Bayesian classifier to be surprisingly effective (Langley et al.,
1992; Domingos&Pazzani, 1996), despite the fact that its independence assumption is usually
over-simplistic.”4
Bayes Theorem gives the following equation to find the probability that an e-mail is
spam (i.e. in category c), given the array x = (x1,x2,x3,…,xn):4
,
where P(Xi |C) (e.g. the probability that an e-mail has particular word given the e-mail is
spam) and P(C) (e.g. the probability that the e-mail is spam) are relative frequencies that are
estimated using the training set.4 In words, this equation says for one word:
3
.
Therefore, for the entire e-mail, which contains the array x words, these probabilities are
multiplied, as seen in the numerical equation above.
The final step to Naïve Bayesian classification is determining a threshold for
classification. Mistaking a legitimate e-mail as spam is a much more severe problem than
letting spam get through the filter by being mistakenly classified at legitimate.4 By assuming
that the earlier error is λ times more costly than the latter, an e-mail can be sorted into the
spam category if:4
.
This equation says that given an e-mail has the array x words, the ratio of the probability that
it is spam to the probability that it is legitimate must be greater than λ to be placed into the
spam folder. For example, the threshold of λ = 999 could be set to say that classifying a
legitimate e-mail as spam is as bad as having 999 spam e-mails clutter one’s inbox.4 While
this threshold may seem very high, it is a sensible value when there is no further processing
of e-mails categorized as spam, since most people would agree that blocking a legitimate email is not acceptable.4
The first viable spam filtering system built upon the principles of the Naïve Bayesian
Classifier was created and discussed in the 1998 scholarly paper, “A Bayesian Approach to
Filtering Junk E-Mail” by Sahami et al.1 Sahami’s work was revolutionary for spam filtering,
in the sense that prior to his work spam filtering was rule based and inadaptable.4 Inadaptable
spam filtering is problematic because spam is ever-changing, and spammers constantly find
new ways to try to beat the system.1 According to Sahami, the rule based filtering method has
limited utility because of the fact that filtering-rule-sets generally make rigid binary decisions
to decide whether an e-mail should be categorized as spam.1 For example, in a rule based
system, an e-mail that contains the word “Viagra” might be automatically sent to the spam
4
folder whereas the Bayesian filter might mark this message as probable spam and then take
into account other factors that could outweigh the word “Viagra” to indicate a legitimate email.5 Because misclassification of a legitimate e-mail is more costly than misclassifying
spam e-mail, Sahami created the “utility model” based on Bayesian Classifying to make sure
this “difference in cost between the two types of errors” is taken into account.1 In Sahami’s
method, each user will personally help create the probable spam probabilities when sorting
the training set, and then over time, these probabilities can be adapted to changing spam by
the user’s indication of wrongly sorted messages. Since each individual user has the ability to
manually flag the spam emails he gets in his inbox, the MI’s used to classify emails as spam
constantly readjust and are personal to the individual. Sahami’s Bayesian filtering approach
made major improvements in spam filtering by being the first adaptable and user specific
spam filtering method.1
As an extension of the Naïve Bayesian Classifying method described above, Sahami
considered supplementary features of e-mails, called “domain specific properties,” in
addition to the words of the message, to classify e-mails more accurately.1 Domain specific
properties include, “particular phrases, such as “Free Money,” or over-emphasized
punctuation, such as “!!!!,” [that] are indicative of junk E-mail.”1 The domain of the sender’s
e-mail address also provides insight as to whether or not an e-mail could be spam.1 For
instance, “.edu” domain e-mail addresses essentially never send spam e-mails.1 To
incorporate these additional characteristics into the Bayesian Classification model, the
domain specific properties can simply be added as additional variables in the characteristic
array for the e-mail.1 Thus detection of the presence of particular domain specific properties
can be “uniformly incorporated into the classification models and the learning algorithms
employed need not be modified.”1
5
Sahami used an array of 500 features, or tokens, including key words, phrases, and
domain-specific properties, based on the training set, to build his classifier.1 Note that Sahami
did not use a large of number experiments to arrive at 500 as the best number of
characteristics to use when filtering, but he found that this value delivered reliable results
after initial experiments.1 In experimenting, Sahami used a medium sized training set of 1538
e-mails and a testing set of 251 e-mails.1Sahami’s Bayesian Classifier was set up to consider
word-based characteristics, as well as 35 phrasal features, and 20 non-textual domain specific
properties to detect spam.1 To incorporate cost-sensitive error detection, a threshold of 99.9%
or greater probability of being spam was used to categorize an e-mail as spam.1 As a result,
Sahami found that the percentage of true spam classified as spam (spam precision) was
97.1% when using words only, 97.6% when using words and phrases, and close to 100%
when using words, phrases, and domain specific features.1 In this experiment, the legitimate
mail precision was 87.7% when using words only, 87.8% when using words and phrases, and
96.2 % when using words, phrases, and domain specific features.1 Thus, the incorporation of
characteristics, like domain specific properties, results in far more accurate classification in
comparison to just using words.1
While Sahami provided a much improved model for spam filtering by using Bayesian
Classification, the number of false-positives (legitimate e-mails mistakenly categorized as
spam) was still too high to be satisfactory.1 In 2002, Paul Graham came up with additional
improvements to Sahami’s methods to greatly reduce the false-positive rate.5 According to
Graham’s model, spam filtering can “now miss less than 5 per 1000 spams, with 0 false
positives.”2 Graham essentially created a much more complex filter that could detect many
more features, however, his filter is still based on the same mathematical principles of
Bayesian filtering.6 Graham realized that headers can tell a lot about the likelihood of an email being spam, and thus they cannot be ignored.6 Graham also took into account where
6
particular words are located within the message.6 Other parts of an e-mail Graham used as
spam indicators include attachments of URLs and the use of colored fonts.2By combining
these factors, as well as others, Graham created far more recognizable tokens
(words/features). Instead of the 500 or so tokens that Sahami trained his filters to recognize,
Graham utilized tens of thousands. He showed that ever more precise tokenization and token
recognition lead to higher levels of spam detection while lowering rates for false positives.
However, this also leads to a significantly higher risk that the user will not bother to check
his spam folder for false positives (which is problematic because it is almost inevitable that a
legitimate email will one day end up in the spam box).
While spammers are continually altering their methods to try and find ways to beat
the spam filtering system, so far, the Naïve Bayesian Classifying based spam-filtering
method has proved to be both adaptable and sufficient. The use of several different
characteristics simultaneously, besides just words, to determine the probability that an e-mail
is spam, has significantly reduced the false-positive rates. However, because mistakenly
classifying legitimate e-mails as spam is so highly unacceptable, Graham and others are still
working to find more ways to eliminate this problem completely for spam filters of the
future. One possible way to address the continual problem of false-positives is to find a way
for future spam filters to recognize non-spam features—an idea that Graham has been
working on, however has not come up with any methods to do this yet.2While spam filtering
today may not be perfect, modern filters based upon the Naïve Bayesian method have been
accurate enough that our inboxes are not filled with spam and our important emails rarely get
shuffled into the spam box. The very fact that most people utilize email and trust their filters
proves that these methods based off the Naïve Bayesian Classifier truly work. These filters
have saved email from turning into a denizen of the past.
7
References
1. Sahami, M. 1998. A Bayesian Approach to Filtering Junk E-Mail. Microsoft Research.
[Online]. http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf.
2. Graham, P. 2002. A Plan For Spam. [Online].http://www.paulgraham.com/spam.html
3. State of Spam: A Monthly Report. Symantec [Online]. 2009 [cited 2011 Nov. 13].
http://sup.kathimerini.gr/xtra/media/files/var/spam_909.pdf.
4. Androutsopoulos, I. An Evaluation of Naïve Bayesian Anti-Spam Filtering. Arxiv
[Online]. 2000 [cited 2011 Nov 13]. 9-17. http://arxiv.org/pdf/cs/0006013v1.
5. Bayesian Spam Filtering. [Online]. http://en.wikipedia.org/wiki/Bayesian_spam_filtering
6. Graham, P. 2003. Better Bayesian Filtering. [Online].
http://www.paulgraham.com/better.html.
8
Download