Spam filtering

advertisement
Spam Filtering
1. Introduction
Spam is unsolicited, unwanted messages mostly sent with commercial intent. The practice of sending spam
messages is known as spamming. The number of bulk mails received is on the increase very year. Also, spamming
is becoming increasingly difficult to detect as the hackers become smarter. Traditional spam detection systems
which use word based detection are easily defeated as spammers find out new ways of representing words. For
example, if the word mortgage was written as ‘m-o-r-t-g-a-g-e’, or as ‘m o r t g a g e’, it would be undetected by
these systems. So, every time a new mutant of a word is detected, it would have to be added to the database. The
next alternative would be to ignore the punctuation marks and use the whole words. This would not last long as the
next change to the spam message would be to represent the word as ‘m0rtgage’ a combination of numerals and
letters. As the rules are made stricter, the chances of a legitimate message being falsely classified as spam increases.
This is totally undesirable as people do not even tend to look at their spam folders any more. The most fundamental
requirement of any spam filter is “to never flag a good message as spam”, even if this means not-detecting a few
spam messages.
2. Approach
Before we decide on an approach to design a spam filter, we will take a look at the most sophisticated spam
filter available today – “Us”. Humans are capable of identifying spam at a glance. How do we do it? There are
several criteria we use to evaluate a message. Some of them are:
1.
The sender
a.
Does the sender’s email look valid or does it look random( Ex: xrd4wd@bambam.com)
b.
Is this a known sender?
2.
The subject
3.
The length of the message
4.
The words used in the message
5.
a.
Commercial words such as mortgage, sell
b.
Overly punctuated words
c.
Words mixed with numerals
The number of images in the message
in that order. These are but a few of the characteristics we use. Except for the 1 st feature, all the others are learnt
by us as we see more of the spam messages. For example, the first time you see a message with the subject “you
have won a lottery”, we would not realize it as spam, but the next time, we see it, we immediately categorize it as
spam. The same applies for 3-5 too.
Simply put, the most robust approach would be to design a filter which would duplicate this classifying act. The
classifier must be capable of learning these characteristics and classify based on them. It must also be capable of
learning on the go. The simplest such classifier would be the Bayes classifier.
3. Bayes Classifier
The Bayes classifier is a simple but effective learning algorithm which can be used to classify the incoming
messages into several classes (ω1,ω2…ωn). In fact, it is capable of much more than just that. The Bayesian
classifier is used in document classification, voice recognition and even in facial recognition. It is a simple
probabilistic classifier (mathematical mapping system) which requires
1.
The prior probability that a given event belongs to a specific class
2.
The likelihood function of a given feature set describing a class P(x|ω1)
Once these data are available, the classifier divides the sample space into disjoint regions(Ω1,Ω2…Ωn). When
there are only two classes (in our case: spam and not-spam), the classifier also provides a decision function δ(x) such
that
δ(x) = ω1 if x Є Ω1
δ(x) = ω2 if x Є Ω2
Initially, the classifier needs to be trained on labeled features to allow it to build up the likelihood functions and
the priori probabilities. After the classifier is put to work, as it comes across newer values for the features, it
automatically adjusts the likelihood functions and the decision boundaries appropriately.
4. Advantages of a statistical filter
A statistical filter provides the following advantages:
1.
It is simple to implement and is computationally inexpensive
2.
Efficient to train and use
a.
Easily accepts new data without a need to retrain completely
3.
Depends on the number of times a word is repeated. Hence, even new words and mutants of words will be
automatically caught.
4.
Can be easily trained to suit individual e-mail patterns.
5. Proposal
I am proposing to use the Bayes Classifier for detecting and classifying incoming e-mail as spam. The classifier
will be trained with both spam and legitimate mail to generate the PDFs. The feature set will be from the ones
discussed in Approach. We will initially start with one feature and attempt to increase the number of features. The
steps involved could be briefly summarized as follows:
Training Phase:
1.
Identify the classes
2.
Calculate the apriori probabilities of the classes.
3.
Calculate the PDFs of the class and feature set
4.
Maintain a dictionary of ‘high frequency’ words and their weights.
5.
Arrive at the decision boundaries.
Testing Phase:
1.
Read a message
2.
Collect the features.
a.
For individual words, use the top ‘N’ weighted words.
b.
If any new words are encountered – add them to our dictionary with a weight
3.
Evaluate the discriminant function.
4.
Classify the message as spam or not spam.
5.
Modify the weights for any new words appropriately.
Features
Initially, the following features will be used
1.
Sender’s e-mail address
2.
Words in the subject and message.
3.
Length of the message
Data sets
My own CSE email would be an excellent source of training and test data. There is an average of 2 junk mails
per day( as classified by the Microsoft Outlook spam filter), about 5 study related mails addressed to me and around
20 messages addressed to my work group. In addition, I also plan on using the bulk mails from my Google account –
which has around 300 spam mails per day and 20 personal mails. Additionally, data sets are also available from [7]
[8] [9] [10]
Thoughts in Progress
I will also need to consider the presence of images and other formatting information such as html tags etc, but
have not decided on how to handle them yet.
1.
What are the weights I assign to words?
2.
How do I manage very short words?
3.
How many words do I consider to categorize a message?
4.
What if a spammer includes a lot of legitimate words to fool us? How do we handle this?
5.
How do I handle new words?
6.
Is there any way, I can make this filter adapt to the user? For example, filter out legitimate
announcements but those that are not of interest to me?
7.
How do I measure the performance?
8.
a.
The speed of operation
b.
The prediction ratio.
What do I do if the user decides that a classified message is otherwise? How does it change the PDFs?
6. References
[1] Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). “A Bayesian approach to filtering junk Email”. Learning for Text Categorization: Papers from the 1998 Workshop. Madison, Wisconsin: AAAI Technical
Report WS-98-05.
[2] Yang, Z., Nie, X., Xu, W., and Guo, J. 2006. An Approach to Spam Detection by Naive Bayes Ensemble
Based on Decision Induction. In Proceedings of the Sixth international Conference on intelligent Systems Design
and Applications (Isda'06) - Volume 02 (October 16 - 18, 2006). ISDA. IEEE Computer Society, Washington, DC,
861-866.
[3] Vangelis Metsis,Ion Androutsopoulos,Georgios Paliouras. "Spam Filtering with Naive Bayes – Which
Naive Bayes?", CEAS 2006 Third Conference on Email and AntiSpam, July 27-28, 2006,Mountain View,
California USA
[4] Aris Kosmopoulos, Georgios Paliouras, Ion Androutsopoulos"Adaptive Spam Filtering Using Only Naive
Bayes Text Classifiers", CEAS 2008 Fifth Conference on Email and AntiSpam, August 2122, 2008, Mountain
View, California USA
[5] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.
[6]
Robinson,
Gary,
Spam
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html, September 2002.
[7] http://spamassassin.apache.org/publiccorpus/
[8] http://www.iit.demokritos.gr/skel/i-config/
[9] http://www.aueb.gr/users/ion/data/enron-spam/
[10]http://wortschatz.uni-leipzig.de/html/wliste.html
Detection,
Download