COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS Web Site: www.ijaiem.org Email: ,

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
COMPARISON AND ANALYSIS OF SPAM
DETECTION ALGORITHMS
Sahil Puri1, Dishant Gosain2, Mehak Ahuja3, Ishita Kathuria4, Nishtha Jatana5
1,2,3,4
Student, Department of Computer Science and Engineering,
Maharaja Surajmal Institute of Technology, New Delhi, India
5
Assistant professor, Department of Computer Science and Engineering,
Maharaja Surajmal Institute of Technology, New Delhi, India
ABSTRACT
Our research paper consists of comprehensive study of spam detection algorithms under the category of content based filtering
and rule based filtering. The implemented results have been benchmarked to analyze how accurately they have been classified
into their original categories of spam and ham. Further, a new filter has been suggested in the proposed work by the
interfacing of rule based filtering followed by content based filtering for more efficient results.
Key words: Spam, AdaBoost, KNN, Chi-Square, Black list, White list, Cache Architecture.
1. INTRODUCTION
Emails today are a fast and inexpensive mode of sharing personal and business information in a convenient way. But its
simplicity and ease of use has also made it a hub of scams. Often we find our inbox full of undesirable mails. So it has
become essential to have reliable tools to detect spam and ham mails. A spam filter is applied on the emails; if it is an
unsolicited mail then it would be dropped to the junk folder else if ham (those mail which are sent by genuine user
hence can be classified as desirable mail) then it would be dropped into the inbox. [15] The basic concept of a spam
filter can be illustrated as follows:
Figure 1: Block diagram of spam filter
Our research comprises of two broad categories. Firstly the analytical study of various spam detection algorithms based
on content filtering such as Fisher-Robinson Inverse Chi Square function, AdaBoost algorithm and KNN algorithm. In
the second category we have worked upon the rules that could be applied on the header part of the email for fast
filtering without considering the content of the mails. The algorithms have been implemented and the results have been
studied to draw a relative comparison on the effectiveness of a technique to identify the most accurate one. Each
technique is demonstrated in the following sections with their implemented result. The paper is concluded with the
benchmarking of the techniques.
2. CONTENT BASED SPAM FILTERING
The basic format of Email generally consists of the following sections:
I. Header section, that includes
1. The sender email address,
2. The receiver email address,
Volume 2, Issue 4, April 2013
Page 1
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
3. The Subject of the email and
II. Content of the email that includes the main body of the email consisting of text, images and other multimedia
format data. [17]
In content based spam filtering, [2] the major focus is classifying the email as spam or as ham, based on the data that is
present in the body or the content of the mail. So the header section is ignored in case of content based spam filtering.
A number of techniques such as Bayesian Filtering, Gary Robinson technique, AdaBoost classifier, KNN classifier,
combining function based on Fisher-Robinson Inverse Chi-Square Function are available which can be used for content
based filtering. This paper specifically includes the comparison between implementations of Fisher-Robinson Inverse
Chi-Square Function, implementation of AdaBoost classifier and KNN classifier.
2.1 FISHER-ROBINSON INVERSE CHI-SQUARE FUNCTION
Chi-Square method is a content based filtering technique which was proposed by Robinson. ”. It uses a probability
function which is also named as “Robinson’s Degree of Belief”. In this probability function, p(w) is the Robinson’s
total probability function, s is a tunable constant, x is an assumed probability given to words never seen before
(hapaxes) , and n is the number of messages containing this token. Initial values of 1 and 0.5 for s and x , respectively,
are recommended.
Robinson has suggested using this function in situations where the token has been seen just a few times. An extreme
case occurs when a token has never been seen before. In such a case, the value of x will be returned.
(Robinson’s Degree of Belief Function)
In this function, Robinson used the number of messages containing that token.
(Robinson’s Token Probability Function)
Another function that has been proposed by Sir Ronald is Fisher-Robinson Inverse Chi-Square Function. This function
is a combining function which is named on the work done by Sir Ronald Fisher.
It consists of three parts. H is the combined probability sensitive to hammy values, S calculates the probability sensitive
to spam values, I is used to produce the final probability in the usual 0 to 1 range, C‾¹ is the inverse chi-square function,
and n is the number of tokens used in the decision matrix.
(Fisher-Robinson’s Inverse Chi-Square Function)
Jonathan Zdziarski gave the corresponding C code for C‾¹ which is as follows:double chi2Q (double x, int v)
{
int i;
double m, s, t;
m = x / 2.0;
s = exp( -m );
t = s;
for( i=1; i<(v/2); i++ )
{
t *= m / i;
s += t;
}
return (s < 1.0) ? s : 1.0;
}
2.2 ADABOOST CLASSIFIER
Volume 2, Issue 4, April 2013
Page 2
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
In this technique, we will investigate the performance of active learning using confidence based label sampling using
Boosting. A variant of AdaBoost algorithm is used to train a classier and obtain a scoring function which can be used to
classify the mail as spam or ham.[20]
AdaBoost technique needs labeled data for training its classifier. Labeled data indicates to the data that has originally
been classified as spam or ham. This data can initially train the classifier which can generate the required functions for
classifying spam messages. The Boosting algorithm is used to improve the training process. AdaBoost, short for
Adaptive Boosting is one of the most widely used boosting techniques. AdaBoost uses a classier recursively in a series
of rounds n = 1,. . . ,N. For each call a distribution of weights D(n) is updated that indicates the importance of each
record in the data corpus for the classification. In each recursive iteration, the weights of each wrongly classified record
is increased, in other words the importance correctly classified record is decreased hence making the new classifier
more sensitive to the incorrectly classified records. In case of AdaBoost, examples are initially identified by the user to
train the classifier manually. Furthermore k records are additionally identified as hard records to train the classifier to
the hard examples, so that the efficiency of the classifier can be improved which will be used to classify the unlabelled
data. [7]
The Active learning technique used is
 Given data corpus C, categorized into unlabeled data corpus C (unlabeled), labeled data corpus C (labeled).
 Recursively iterate
o Using the labeled data corpus, C (labeled), trains the classifier.
o Using the above generated classifier, test the C (unlabeled) corpus and generate scores using a scoring
function.
o Associate each record with the corresponding above generated score.
o Label the records with the lowest scores. (Hard k records to make the classifier efficient).
o Include the newly labeled data records into C (labeled) corpus.
o Remove the newly labeled records from the C (unlabeled) corpus.
The scoring criteria used to find the k hard records is Boosting in which the choice is based on the weighted majority
vote. Training is carried out by using AdaBoost algorithm
 Given(
)…(
)
 Initialize weights
 For t=1 to T do

 For each feature j, train a classifier
 Compute error,
 Choose classifier with lowest error
 Update weights

Compute
where
=log(
 Final output, strong classifier, h(x)=
AdaBoost algorithm
The scoring function used is:
confidence score, score( ) =
where,
2.3 KNN CLASSIFIER
K Means: The aim of K Means is to partition the objects in such a way that the intra cluster similarity is high but inter
cluster similarity is comparatively low. A set of n objects are classified into k clusters by accepting the input parameter
k. All the data must be available in advance for the classification.[5]
KNN: Instead of assigning to a test pattern the class label of its closest neighbor, the K Nearest Neighbor classifier
finds k nearest neighbors on the basis of Euclidean distance.
)
The value of k is very crucial because the right value of k will help in better classification. [6]
3. RULE BASED FILTERING
Volume 2, Issue 4, April 2013
Page 3
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
Rule Based Filtering uses the concept of rules to classify mails as spam or ham. These rules may be applied on “To:”,
“From:” or “Subject:” field of the header or the body of the mail. Different rules may vary from checking the font size
of the text to checking whether the mail arrived from an address in the person’s address book or searching the subject
line for words like ‘free’, ’sale’ and so on. [18]
Figure 2: Flowchart of rule based Filtering
3.1 CACHE ARCHITECTURE
Cache architecture consists of two lists namely black list and white list.
Black List: In our research we have put a few domains and email ids in the black list which were presumed of causing
danger or threat. For example those websites can be put in blacklist which have a past record of fraudulent or which
exploits browser’s vulnerabilities. In creating a filter; if the sender of mail has its entry in the black list then that mail is
undesirable and will be considered as spam.
White List: It is opposite to the black list concept. It consists of the list of entries which can penetrate through and are
authorized. These mails are considered as ham mails and can be accepted by the user. It has a set of URLs and domain
names that are legitimate.
After creating both the lists when any email arrives the ‘To’ and ‘From’ field is extracted from its subject to check if it
is in the black list or the white list. The main rule applied here is that if the sender is from the black list then it will be
considered as a spam mail. The concept is illustrated in fig. 2.
Figure. 3: Block diagram of cache architecture
3.2 RULES APPLIED ON HEADER
Some of the rules applied on the header are as follows: [21]

Mailing List:
Volume 2, Issue 4, April 2013
Page 4
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
This rule is applied to the from and to section of the email, i.e. if the sender or receiver address corresponds to an
address in blacklist or white list, corresponding action can be taken. If the mail is multicast to a mailing list
comprising of number of addresses above a certain threshold like 10,000, it considered to be a spam. [20]
e.g.: To: ankita@yahoo.com, kity@yahoo.com, kuhu@yahoo.com… (>10,000 entries)

Pattern:
When the header (mailing list or subject) depict some pattern like ankit*@gmail.com then the mail is categorized as
spam.
e.g.: ankita1@gmail.com, ankita2@gmail.com,
ankita3@gmail.com…
Hurray!!!!!!!! You win a free Iphone!!!!!!!
 Content Based Filtering on Subject:
Here the normal content based filtering is applied to the subject line itself to find whether it contains words
classified as spam or ham.
This technique serves better than content based filtering applied on the entire body as the length of subject is usually
restricted to 1 or 2 lines (15-20words). Thus, improving the efficiency of the filter to classify a mail.
e.g.: Subject: You have won cash price of Rs 10 lakhs!!!!!
3.3 RULES APPLIED ON THE BODY
Some of the rules applied on the body of the mails are:
1. Font Size: Generally spam mails consist of large fonts. So body is checked for words with higher font size, if
frequency higher than a preset threshold, email can be declared as spam.
2. Font Color: Spam mails usually comprise of large variation in the color of text to attract the receiver. Again the
body can be scanned for words with different colors and frequency can be checked to be classified as spam or
ham.
Some more rules that can be applied on the header are as follows:
1. Consists entirely of images. If the body consists of only images, generally a spam message.
2. From: contains empty name, can be classified as spam email.
3. The ‘From’ field of the subject starts with many numbers. If the from field consists of many numbers, it is
generally assumed to be a spam message with machine created email address.
4. From: has no local-part before @ sign
5. Message-ID contains multiple '@' characters
6. The Subject contains Gappy-Text
4. BENCHMARKING OF TECHNIQUES
The major techniques illustrated in the previous sections have been implemented and the results are shown in the
table 1. [16] The mails are categorized as:
 Spam mails that were incorrectly classified as ham mails.
 Spam mails that were correctly classified as spam mails.
 Ham mails that were correctly classified as ham mails.
 Ham mails that were incorrectly classified as spam mails.
Table 1: Classification of mails
Name of Technique
AdaBoost
Classifier
KNN
Classifier
Spam as ham
8
Spam as spam
54
Volume 2, Issue 4, April 2013
Chi Function
5
Cache Architecture(dataset of 431
emails)
2
57
48
60
2
Page 5
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
Ham as ham
17
42
230
48
Ham as spam
32
7
7
1
Correctly
Classified
78%
89%
64.5%
97.2%
Incorrectly
Classified
22%
11%
2.8%
Execution Time
48740ms
8290ms
2.1%
(Rest 33.4% cannot be
classified)
3797ms
38656ms
5. PROPOSED WORK
After studying all the techniques that can be beneficial for the detection of undesirable mails our progress was made by
interfacing of the content based filtering and rule based filtering to produce more efficient results. [19] The progress is
suggested as:



Figure. 4: Flowchart of the Proposed Work followed.
Firstly we are implementing rule based filtering which will indentify the spam mails according to the cache
architecture and other header based rule.
Implementing content based filtering technique, i.e. Fisher-Robinson Inverse Chi-Square Function. We chose
this technique because it is showing the most accurate results as compared to other content based techniques
(refer benchmarking in section 4).
Pass the unclassified mails of rule based filter through content based filtering which will finally classify the
remaining mails.
This approach of interfacing both the modules is followed to decrease the filtering time, and increase the efficiency
of the filter we have developed.
6. CONCLUSION
Various techniques of content based filtering are studied and analyzed. The implemented results are mentioned in
Table 1. The most efficient technique according to our research is chosen and interfaced with the rule based filtering
Volume 2, Issue 4, April 2013
Page 6
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 4, April 2013
ISSN 2319 - 4847
technique to create a spam filter that is most efficient of the all mentioned in the previous sections. It will reduce the
filtering time.
REFERENCES
[1] Z. Gy¨ongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International
Conference on Very Large Data Bases, Aug. 2004.
[2] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning
for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.
[3] A. Perkins. The Classification of Search Engine http://www.silverdisc.co.uk/articles/spam-classification/
[4] Z. Gy¨ongyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data
Bases, Aug. 2005.
[5] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani and Liadan O’Callaghan, “Clustering Data
Streams,” IEEE Trans.s on Knowledge & Data Engg., 2003.
[6] Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edn.
[7] Lentczner, M. and Wong, M. “Sender Policy Framework: Authorizing Use of Domains in MAIL FROM”, Internet
Draft, http://www.ietf.org/internet-drafts/draft-lentcznerspf-00.txt, October, 2004..
[8] Boykin O. and Roychowdhury V., "Personal Email networks: an effective anti-spam tool", Condensed Matter condmat/0402143, pp. 1-10, 2004.
[9] Sahami M., Dumais S. et al., "A Bayesian Approach to Filtering Junk E-Mail", Learning for Text Categorization:
Papers from the 1998 Workshop, Madison, Wisconsin, pp. 1-8, 1998
[10] Golbeck J. and Hendler J., "Reputation Network Analysis for Email Filtering", CEAS, pp. 1-8, 2004.
[11] Sculley D., Wachman G. et al., "Spam Filtering Using Inexact String Matching in Explicit Feature Space with OnLine Linear Classifiers", Text REtrieval Conference, pp. 1, 2006.
[12] YOSHIDA K., ADACHI F. et al., "Densitybased spam detector", Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining, Seattle, WA, USA, pp. 486-493, 2004.
[13] Lam H.-Y. and Yeung, D.-Y., "A Learning Approach to Spam Detection based on Social Networks", Conference
on Email and Anti-Spam, CEAS 2007, pp. 1-9, 2007.
[14] Zhang L., Zhu J. et al., "An evaluation of statistical spam filtering techniques", vol. 3, no. 4, pp. 243-269, 2004.
[15] Hayati P. and Potdar V., "Evaluation of spam detection and prevention frameworks for email and image spam: a
state of art", Proceedings of the 10th International Conference on Information Integration and Web-based
Applications \& Services, Linz, Austria, pp. 520-527, 2008.
[16] Heron S., "Technologies for spam detection",Network Security, 2009, 1, pp. 11-15, 2009.
[17] Khorsi A., "An Overview of Content-Based Spam Filtering Techniques", Informatica (Slovenia), pp. 269-277,
2007.
[18] Goodman, J. “IP Addresses in Email Clients”, Conference on Email and Anti-Spam 2004, July 2004.
[19] Segal, R. “Combining Multiple Classifiers”, Virus Bulletin, February 2005.
[20] Peng Wu, Hui ZhaoSome, “Analysis and Research of the AdaBoost Algorithm”, Intelligent Computing and
Information Science Communications in Computer and Information Science Volume 134, 2011
[21] http://spamassassin.apache.org/
Volume 2, Issue 4, April 2013
Page 7
Download