Spam Filtering 1. Introduction Spam is unsolicited, unwanted messages mostly sent with commercial intent. The practice of sending spam messages is known as spamming. The number of bulk mails received is on the increase very year. Also, spamming is becoming increasingly difficult to detect as the hackers become smarter. Traditional spam detection systems which use word based detection are easily defeated as spammers find out new ways of representing words. For example, if the word mortgage was written as ‘m-o-r-t-g-a-g-e’, or as ‘m o r t g a g e’, it would be undetected by these systems. So, every time a new mutant of a word is detected, it would have to be added to the database. The next alternative would be to ignore the punctuation marks and use the whole words. This would not last long as the next change to the spam message would be to represent the word as ‘m0rtgage’ a combination of numerals and letters. As the rules are made stricter, the chances of a legitimate message being falsely classified as spam increases. This is totally undesirable as people do not even tend to look at their spam folders any more. The most fundamental requirement of any spam filter is “to never flag a good message as spam”, even if this means not-detecting a few spam messages. 2. Approach Before we decide on an approach to design a spam filter, we will take a look at the most sophisticated spam filter available today – “Us”. Humans are capable of identifying spam at a glance. How do we do it? There are several criteria we use to evaluate a message. Some of them are: 1. The sender a. Does the sender’s email look valid or does it look random( Ex: xrd4wd@bambam.com) b. Is this a known sender? 2. The subject 3. The length of the message 4. The words used in the message 5. a. Commercial words such as mortgage, sell b. Overly punctuated words c. Words mixed with numerals The number of images in the message in that order. These are but a few of the characteristics we use. Except for the 1 st feature, all the others are learnt by us as we see more of the spam messages. For example, the first time you see a message with the subject “you have won a lottery”, we would not realize it as spam, but the next time, we see it, we immediately categorize it as spam. The same applies for 3-5 too. Simply put, the most robust approach would be to design a filter which would duplicate this classifying act. The classifier must be capable of learning these characteristics and classify based on them. It must also be capable of learning on the go. The simplest such classifier would be the Bayes classifier. 3. Bayes Classifier The Bayes classifier is a simple but effective learning algorithm which can be used to classify the incoming messages into several classes (ω1,ω2…ωn). In fact, it is capable of much more than just that. The Bayesian classifier is used in document classification, voice recognition and even in facial recognition. It is a simple probabilistic classifier (mathematical mapping system) which requires 1. The prior probability that a given event belongs to a specific class 2. The likelihood function of a given feature set describing a class P(x|ω1) Once these data are available, the classifier divides the sample space into disjoint regions(Ω1,Ω2…Ωn). When there are only two classes (in our case: spam and not-spam), the classifier also provides a decision function δ(x) such that δ(x) = ω1 if x Є Ω1 δ(x) = ω2 if x Є Ω2 Initially, the classifier needs to be trained on labeled features to allow it to build up the likelihood functions and the priori probabilities. After the classifier is put to work, as it comes across newer values for the features, it automatically adjusts the likelihood functions and the decision boundaries appropriately. 4. Advantages of a statistical filter A statistical filter provides the following advantages: 1. It is simple to implement and is computationally inexpensive 2. Efficient to train and use a. Easily accepts new data without a need to retrain completely 3. Depends on the number of times a word is repeated. Hence, even new words and mutants of words will be automatically caught. 4. Can be easily trained to suit individual e-mail patterns. 5. Proposal I am proposing to use the Bayes Classifier for detecting and classifying incoming e-mail as spam. The classifier will be trained with both spam and legitimate mail to generate the PDFs. The feature set will be from the ones discussed in Approach. We will initially start with one feature and attempt to increase the number of features. The steps involved could be briefly summarized as follows: Training Phase: 1. Identify the classes 2. Calculate the apriori probabilities of the classes. 3. Calculate the PDFs of the class and feature set 4. Maintain a dictionary of ‘high frequency’ words and their weights. 5. Arrive at the decision boundaries. Testing Phase: 1. Read a message 2. Collect the features. a. For individual words, use the top ‘N’ weighted words. b. If any new words are encountered – add them to our dictionary with a weight 3. Evaluate the discriminant function. 4. Classify the message as spam or not spam. 5. Modify the weights for any new words appropriately. Features Initially, the following features will be used 1. Sender’s e-mail address 2. Words in the subject and message. 3. Length of the message Data sets My own CSE email would be an excellent source of training and test data. There is an average of 2 junk mails per day( as classified by the Microsoft Outlook spam filter), about 5 study related mails addressed to me and around 20 messages addressed to my work group. In addition, I also plan on using the bulk mails from my Google account – which has around 300 spam mails per day and 20 personal mails. Additionally, data sets are also available from [7] [8] [9] [10] Thoughts in Progress I will also need to consider the presence of images and other formatting information such as html tags etc, but have not decided on how to handle them yet. 1. What are the weights I assign to words? 2. How do I manage very short words? 3. How many words do I consider to categorize a message? 4. What if a spammer includes a lot of legitimate words to fool us? How do we handle this? 5. How do I handle new words? 6. Is there any way, I can make this filter adapt to the user? For example, filter out legitimate announcements but those that are not of interest to me? 7. How do I measure the performance? 8. a. The speed of operation b. The prediction ratio. What do I do if the user decides that a classified message is otherwise? How does it change the PDFs? 6. References [1] Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). “A Bayesian approach to filtering junk Email”. Learning for Text Categorization: Papers from the 1998 Workshop. Madison, Wisconsin: AAAI Technical Report WS-98-05. [2] Yang, Z., Nie, X., Xu, W., and Guo, J. 2006. An Approach to Spam Detection by Naive Bayes Ensemble Based on Decision Induction. In Proceedings of the Sixth international Conference on intelligent Systems Design and Applications (Isda'06) - Volume 02 (October 16 - 18, 2006). ISDA. IEEE Computer Society, Washington, DC, 861-866. [3] Vangelis Metsis,Ion Androutsopoulos,Georgios Paliouras. "Spam Filtering with Naive Bayes – Which Naive Bayes?", CEAS 2006 Third Conference on Email and AntiSpam, July 27-28, 2006,Mountain View, California USA [4] Aris Kosmopoulos, Georgios Paliouras, Ion Androutsopoulos"Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers", CEAS 2008 Fifth Conference on Email and AntiSpam, August 2122, 2008, Mountain View, California USA [5] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html. [6] Robinson, Gary, Spam http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html, September 2002. [7] http://spamassassin.apache.org/publiccorpus/ [8] http://www.iit.demokritos.gr/skel/i-config/ [9] http://www.aueb.gr/users/ion/data/enron-spam/ [10]http://wortschatz.uni-leipzig.de/html/wliste.html Detection,