Algorithms for Spam Classification

advertisement
Algorithms for Spam Classification
Bridget Kelly
Kyle McCombs
University of North Carolina at Wilmington
CSC 380
Abstract
There are different text classification
methods that are used to identify spam
emails. A well-known and widely used
method is the naïve bayes classifier. This
paper will explore this method and the
markhov model in terms of success in
correctly identifying spam emails. The
success rates of these two varying methods,
one of which assumes independence in the
model, and one of which utilizes chaining,
will be compared to see which one is more
effective.
1. Introduction:
A spam email is defined as an unsolicited,
usually commercial, email that is sent to a
large number of addresses. The prevalence
of spam is not a figure that is universally
agreed upon, studies vary in stating that
spam emails account for anywhere from 70
– 95 % of all email correspondence
today.[3]
Machine
learning
text
classification techniques are an accurate way
to identify spam emails, and in this paper
two of these such methods will be explored
– the naïve bayes classifier, and the markhov
model.
2. Methods:
In order to test our classification
methods, we used a spam archive found on
the website http://untroubled.org/spam/.
This archive consists of text files of spam
emails all collected from one individual, the
earliest from 1998, and the archive is
regularly updated with the most recent spam
emails. The emails from this archive
were used for both our training set and our
testing set for the spam classification
methods explored in this paper. Our
methods focused on achieving a high
success rate of correctly identifying spam
emails, and reducing false negative
identification. We did not focus on
identifying non-spam emails or false
positive identification.
For both methods of identifying
spam, our process began the same way. We
developed a program that reads in the spam
text files and parses through the body of the
email word by word. We create a word bank
that saves each word that occurs in every
spam email we evaluate and the frequency
of that word’s appearance in all of the
emails. This information is used to associate
each word with a probability that that word
is in any given spam email. These individual
probabilities are used in the naïve bayes
classifier, and the markhov model.
2.1 Naïve Bayes Classifier:
The naïve bayes classifier is one of
the oldest and most commonly used methods
to identify spam emails, its first known
usage being in 1996. This classifier utilizes
the “bag of words” text identification model,
the important features of this being that the
model disregards grammar, word order, and
punctuation, but not multiplicity of the
word. Essentially when using this model,
every structure that is evaluated, be it a
sentence, or in our case, an email, is seen
simply as a collection of words.[2]
Another important component when
understanding the naïve bayes classifier is
that it operates under the assumption of
independence among features in the model,
the value of any feature is unrelated to the
presence or absence of any other feature. In
terms of this application, this means that
each word contributes an individual
probability to the email in question being
spam. [1]
In terms of implementing the naïve
bayes classifier for spam identification in
our application, the following formula is
used:
This formula considers each word’s effect
on the email in question being spam. The p
here represents the probability that the email
in question is spam, the p1, p2, etc. represent
the probability associated with each word in
the email, and the n represents the total
number of words within the email being
evaluated. This formula does not take into
consideration the probability of any given
email being spam regardless of contents (for
example considering that 85% of all emails
are spam), because doing this can result in
more frequent false positive identification.
Instead of the multiplication shown in the
formula, the classifier value is actually
calculated using addition in the log domain
within our program to eliminate floatingpoint underflow.
After calculating the classifier value
from the formula above, that is then
compared to a threshold to determine if an
email can be classified as spam or not.
When implementing our naïve bayes
classifier, we evaluated 15,000 emails in our
testing phase. Our preliminary threshold
value was the average classifier value from
the 15,000 emails in the testing phase. This
seemed like a reasonable option for the
threshold because it is the mean of the naïve
bayes classifier value for 15,000 emails that
are confirmed to be spam. When using this
as the threshold, we had a 2.86% success
rate (86/3,000 emails correctly identified) in
identifying spam emails during our testing
phase.
After running some statistical
analysis on a dataset comprised of the
15,000 naïve bayes values from the learning
set, we found that those values are not
normally distributed. In terms of the
effectiveness of our classifier, this is
irrelevant, but often the mean is not the best
representation for data that is not normally
distributed, so this led us to explore other
options for the threshold to raise our success
rate. When using the median value of the
15,000 classifier values as the threshold for
the 3,000 emails in our testing set, we had a
success rate of 52.03%, or 1,561 out of
3,000 emails correctly identified.
In order to improve the model
further, only the fifteen words with the
highest associated spam probabilities were
included in the model for determining the
overall spam probability of a given email.
This is done because no matter how many
neutral words there are (or words associated
with a low spam probability), they should
not affect the statistical significance of
highly incriminating spam words. When
using this model, we had a success rate of
97.13%, or 2,914 out of 3,000 emails
correctly identified as spam.
The complexity of running the naïve
bayes classifier on any given email is O(n),
where n is the number of words within the
email is evaluated.
2.2 Markhov Model:
The
second
approach
we
implemented to construct a spam-email
classifying algorithm involved a Markhov
Model. Markhov Models are used for
modeling randomly-changing systems in
which it is assumed that future states depend
only on the present state and not on the
sequence of events that preceded it.
The process of implementation
began with obtaining a training set of
emails, which were all confidently classified
as spam emails. The training set we used
contained 3000 spam emails, all written as
text files. Parsing through every single pair
of adjacent words in the entire training set,
we calculated and stored the average
occurrence rate that word ‘B’ followed word
‘A’ in all spam emails in the training set,
for all pairs of adjacent words unless one of
the words contained a period, exclamation
mark, or question mark. The presence of one
of the mentioned punctuation implies that
the two words being examined, are not
directly related, thus the occurrence rate
between the two words is irrelevant. Once
we have the average occurrence rate that a
word follows another word calculated for
every word in the training set, we create a
weighted directed graph to store theses
values. The vertices of the graph are
composed of every distinct word found in
the training set. Each vertex has edges going
to other vertices which represents the
corresponding adjacent words. Each edge
contains a weight, which corresponds to the
average occurrence rate between the two
vertices. Once all of the occurrence rates are
stored, we can now classify an email. To
classify an email in question, we parsed
through the email examining adjacent
words, then looked up the average
occurrence rate for each of these pairs. If the
edge is not found, this means that the two
words in question never appeared beside
each other in the training set, and ultimately
does not make any contribution to the total
grade of the email. After summing these
average occurrence rates and then finding
the average, this gives us the final grade of
the email in question. To determine a
threshold for declaring an email spam or
not-spam, we ran the above algorithm on the
training set and took the average grade of all
of the spam emails examined. To test our
algorithm, we obtained another set of spamemails that contained 1000 spam emails. We
then graded each email in the test set using
the determined threshold, if the grade of the
email in question was lower than the
threshold, we declared it non-spam, if the
grade was above the threshold, it was
declared spam. The final results revealed
that our algorithm had a 92% success rate at
classifying spam emails. Out of 1000 spam
emails in our test set, 920 of them were
classified as spam by our algorithm.
3. Conclusion:
The naïve bayes classifier and the markhov
model are different in several ways, but the
most important way in terms of our
application is that the naïve bayes assumes
independence among features and the
markhov model utilizes chaining to evaluate
a words associated probability with another
word. The markhov model is more difficult
to implement compared to the somewhat
simplistic naïve bayes classifier, and has a
greater time complexity when running. After
completing testing for both of these
classification models, the naïve bayes
classifier had a higher success rate than the
markhov model. The simplistic naïve bayes
model is still heavily used today to identify
spam emails because of its simplicity and
high success rate.
4. References:
[1] Jiang, Liangxiao, Harry Zhang, and
Zhihua Cai. "A Novel Bayes Model: Hidden
Naive Bayes." IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING
21.10 (2009). IEEE Xplore Digital Library.
Web. 1 Nov. 2014.
<http://ieeexplore.ieee.org/xpl/freeabs_all.js
p?arnumber=4721435&abstractAccess=no&
userType=inst>.
[2] A. Almeida, Tiago, Akebo Yamakami,
and Jurandy Almeida. "Evaluation of
Approaches for Dimensionality Reduction
Applied with Naive Bayes Anti-Spam
Filters." 2009 International Conference on
Machine Learning and Applications (2009).
IEEE Xplore Digital Library. Web. 1 Nov.
2014.
<http://ieeexplore.ieee.org/xpl/freeabs_all.js
p?arnumber=5381433&abstractAccess=no&
userType=inst>.
[3] Graham, Paul. "Better Bayesian
Filtering."
Web.
<http://www.paulgraham.com/better.html>.
Download