A Mathematical Approach for Filtering Junk E

advertisement
A Mathematical Approach for Filtering Junk E-Mail using Relevance Analysis
S.SathyaBama
Assistant Professor,
Department of MCA,
Sri Krishna College of Technology,
Coimbatore, Tamil Nadu, INDIA
Mobile: 98655 33391
ssathya21@gmail.com
M.S.Irfan Ahmed
Director, Department of MCA,
Nehru Institute of Engineering and
Technology,
Tamil Nadu, INDIA
Mobile: 90037 50009
msirfan@gmail.com
A.Saravanan
Assistant Professor,
Department of MCA,
Sri Krishna College of Technology,
Coimbatore, Tamil Nadu, INDIA
Mobil: 98420 06163
a.saravanan21@gmail.com
Abstract
In this growing information technology world, all most all standard communication occurs
through e-mails. Managing a mailbox has thus become a vastin this e-world.Genuinely most
of the inboxes are flooded with spam e-mails, especially, when the user linked with social
networks. This may lead to many sort of attacks. These are harmful and have offensive
comment. Due to the low cost involved in sending E-mails, Companies and several people send
bulk messages in the form of spam.Thus it is necessary improve the spam controlalgorithms
on various aspects. This paper presents a mathematical approach to restrictthe spam e-mails
through subject and content relevancy of the e-mail. Results of this approach is used to classify
the e-mail to be spam.
Keywords- E-mail, E-mail spam, content relevancy
1. Introduction
Due to the rigorous use of internet, E-mail is undoubtedly a very effective means of
communication which is considered to be cheap and easy. Due to the low cost involved in
sending E-mails, companies and several people send bulk messages in the form of spam [1].
Spam is also known as unsolicited Commercial Email (UCE) and unsolicited Bulk Email
(UBE) or junk mail [2]. It is the practice of sending unwanted e-mail messages, frequently with
commercial content, in large quantities to an indiscriminate set of recipients. However, the user
is confronted with so many unwanted e-mails where they miss their important e-mails just
because their mailbox space is often eaten up by these unwanted e-mails. Spam in e-mail started
to become a problem when the Internet was opened up to the general public in the mid-1990s.
It grew exponentially over the following years, and today composes some 80 to 85% of all the
e-mail in the world [3]. Pressure to make e-mail spam illegal has been successful in some
jurisdictions, but less so in others [4]. Spammers take advantage of this fact, and frequently
outsource parts of their operations to countries where spamming will not get them into legal
trouble. Generally, Spammers collect e-mail addresses from chat rooms, websites, customer
lists, newsgroups, and viruses which harvest users' address books, and are sold to other
spammers. They also use a practice known as e-mail appending or e-pending in which they use
known information about their target (such as a postal address) to search for the target's e-mail
address.
Spam in the past contains known strings or patterns which are not necessary for the user.
Unfortunately, majority of e-mail clients now render Hypertext Markup Language (HTML)
based e-mails, allowing spammers many opportunities to fool the filters. Content based filters
require never-ending tuning and adjustment in order to keep up with the spammers’ latest
tricks. Consequently, there are many organizations, as well as individuals, who have taken it
upon themselves to fight spam with a variety of techniques. But, because the Internet is public,
there is really little that can be done to prevent spam, just as it is impossible to prevent junk
mail.
There are several algorithms available for detecting and filtering spam emails. Content based
classification, analyzes the contents and packet of an e-mail using Bayesian networks [5] or
pattern matching [6]. Among the existing algorithms, Bayesian filtering [7, 8] produces best
result, still it does not detect all the spam emails. Most of the existing algorithms considers
content alone for filtering the spam e-mails. To detect all the spam e-mails, existing spam
filtering methods has to be enhanced. Another approach is Domain Keys Identified Mail
(DKIM) [9], which associates a responsible identity with each e-mail. Allowing the receiver to
confirm the sender and origin of the email. Unfortunately, this system does not prevent the bot
from using the identities stored on the hijacked computer and sending email through the
domain’s relays. It does however, make it easier to identify the source of the email. Adoption,
as in many other cases, may prove to be the biggest hurdle for DKIM.
Thus, this paper presents the weight based relevance score for filtering the spam messages.
This weight based approach classifies the e-mail and directs them to inbox or sapm folder of
the user. The work of the paper is described as follows. Section 2, reveals the literature survey
made related to the proposed work. Section 3 describes the proposed architecture and section
3.1 presents the proposed algorithm for spam filtering. Section 4 explains the experimental
results and finally Section 5 provides the conclusion.
2. Related work
Carreras et al. [10] proposed a Boosting algorithm for AntiSpam filtering in which possibility
of misclassification costs persist. Hidalgo et al. [11] presents a new dimension for spam e-mail
classification. Methods based on speech act theory and support vector machine are developed
for spam categorization [12, 13]. Even though support vector approach outperforms well,
switching from training model need user intervention and reply e-mails are considered as no
spam. Nikolos et al. [14] implemented new technique for spam categorization couple with
header information and content information. Even though the conceptualization is good, but
the practical bottle neck will comes for identification of spam words from the global set which
takes more time. Peng et al. [15] proposed spam filter in distributed environment. Wanli et al
[16] projected a new techniques for identifying spam e-mail of content type image. Bayesian
spanning tree with Likelihood function to identify the e-mail in the e-mail space is proposed
[17].
Several methods were proposed for web content outlier mining. The knowledge from these
algorithms can be used in spam detection which is considered as the outlier in e-mails. Malik
Agyemang et al establish the presence of outliers on the web with various types of outliers
present on the web and designed a framework with hybrid approach for mining web content
outliers using full word matching [18, 19, 20, 21]. G.Poonkuzhali et al presented the
mathematical approach based on set theoretical, signed approach, rectangular and correlation
approach for mining web content outliers [22, 23, 24, 25]. K.Thiagarajan et al. implemented
weighted graph approach of trust reputation management through signed concept which can
also be applied for retrieving the relevant content, SMS and SPAM filtering [26]. Spam
filtering using signed and trust reputation management is presented in [27]. Spam filtering
using Fuzzy approach is proposed in [3]. Giuseppe Antoio Di Lucca et al. proposed an
algorithm based on clone detection and similarity metrics to detect duplicate pages in web sites
and application implemented with HTML which works only for structured web documents
[28]. Min-yan Wang et al. suggested a web page de-duplication method in which the
information including original websites and web titles are extracted to eliminate duplicated
web pages based on feature codes with the help of URL hashing [29]. Yunhe Weng et al. come
up with an idea of improved COPS (Copy Detection Algorithm) scheme which aims to protect
intelligent property of the document owner by detecting overlap among documents [30].
Zhongming Han et al. developed a novel multilayer framework for detecting duplicated web
pages through two similarity text paragraphs detection algorithms based on Edit distance and
bootstrap method [31]. Thus many of the existing approaches are application centric and not
user centric. Also classification of normal and spam e-mails is having more preference over
management of e-mails chosen by the user
3. Proposed Architecture
The Figure 1 shows the architectural design of the proposed Spam Filtering System. When an
e-mail arrives to the proposed system, it pass through four phases for spam filtering. They are
i) Pre-processing phase ii) Weight assigning phase iii) Relevancy analysis phase and
iv)
Decision making phase.
Figure 1. Proposed Architectural Design for Spam Filtering
Whenever the E-mail is received, the subject and content of the e-mail is pre-processed. Before
pre-processing, the user can move the mail to spam if it is identified. This phase transforms the
extracted content into the structured form which improves the efficiency of the entire phase. In
the next weight assigning phase, the weights will be assigned separately for mail id, subject
and content of the mail. Based on the assigned weights, the relevance score will be calculated.
Based on the relevance score a final decision will be made to categorize normal and spam email.
Relevancy Score
The weight for the email was given based on the sender email address available in the receiver’s
address book. If the senders address is available on the receiver’s address book, the value 1
will be assigned, if it is there in the spam list value -1 is assigned else 0 is assigned. Then the
contents are pre-processed. The weight is assigned for the subject and the content based on the
white list to find the relevancy of the received mail. Finally the mean of weights will be
calculated as a relevance score. The algorithm is given in section 3.1. Final decision was made
that recommends whether the received e-mail is to be placed in inbox or spam.
3.1 Algorithm for Spam Filtering
Input: Received E-Mail
Output: Decision based on Relevance Score
Step 1: Check the sender mail address in address book. If there is a match, assign W1=1; If
not, check the address in Spam list. If found assign W1=-1; else assign W1=0;
Step 2: Pre-process the terms in the subject and content separately by removing stop words
and perform stemming.
Step 3: For each STi,
If match is found in White List, then increment W2 by 1.
Else if match found in the Black List decrement W2 by -1
Else, increment W2 by 0.5.
W2 = W2 / n1, STi be the list of terms in the subject where 1≤ i ≤ n1; n1 is the total
words in the subject.
Step 4: For each CTi,
If match is found in White List, then increment W3 by 1.
Else if match found in the Black List decrement W3 by -1
Else, increment W3 by 0.5.
W3 = W3 / n2, STi be the list of terms in the subject where 1≤ i ≤ n2; n2 is the total
words in the content.
Step 5: Calculate the Relevance Score. RS = (W1+W2+W3)/3
Step 6 If RS ≥ 0.5, Move the mail to Inbox by updating the white list, else move it to Spam
and update the black list.
4. Experimental Results with Verification
The proposed algorithm is compared with ID3 algorithm. The decision made by the proposed
method is same as the result produced by ID3. The sample result is shown in the Table 1.
Email ID
(W1)
1
0
1
0
0
1
1
0
0
1
Subject
Content
Relevance Score
Result
(W2)
(W3)
(RS)
0.83
0.87
0.9
Inbox
0.17
0.47
0.21
Spam
0.87
0.6
0.83
Inbox
0.40
0.53
0.31
Spam
0.33
0.67
0.33
Spam
0.33
0.67
0.67
Inbox
0.83
0.59
0.81
Inbox
-0.2
0.5
0.10
Spam
0.83
0.67
0.5
Inbox
0.40
0.73
0.71
Inbox
Table 1.Result for sample dataset
From the above table it is clear that this weight based approach identifies the Spam mails.
5. Conclusion
This paper proposes weight based approach for spam detection. It is based on the Mail Id, terms
in the subject and content. The proposed approach out performs in terms of accuracy in
deducting spam e-mails than the existing approaches. The proposed approach works only for
e-mails having subject and body content as plain text. Future work aims at deducting spam
mails having images.
Reference
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Spam and Social technical gap – IEEE Computer, Vol 37 Oct 2004.
Tom Fawcett,”In vivo. Spam Filtering: A challenge problem for datamining” KDD
Explorations vol.5 no.2, Dec 2003. pp.140-148
P.Sudhakar1, G.Poonkuzhali2, K.Thiagarajan 3, K.Sarukesi4, “Terminator for E-mail
Spam - A Fuzzy Approach Revealed”, International Journal of Computers, Issue 3, Volume
5, 2011.
E-mail Metrics report http://www.maawg.org/email_metrics_report
Grahm P. A plan for spam. In Reprinted in Paul Graham,Hackers and Painters, Big Ideas
from the Computer Age, O’Really, 2004.
Showalter,
T.
RFC
3028
–
Sieve:
A
MailFilteringLanguage.
http://tools.ietf.org/html/rfc3028, 2001.
Sahami M, Dumais S, Heckerman D and Hortivz E, A bayesian approach to filtering junk
e-mail. In Workshop on Learning for Text Categorization - AAAI-1998.
Graham, P., Better Baysian Filtering. In Proceedings of Spam Conference
http://spamconference.org/proceedings2003.html, 2003.
Allman E, Callas J, Delany M, Libbey M, Domain keys identified mail (DKIM) signatures.
http://www.ietf.org/internetdrafts/draft-ietfdkim-base-10.txt, 2007.
Carreras, X. and Mdrquez, L., “Boosting trees for anti-spare e-mail filtering”, In Proc. of
RANLP, 2001.
Hidalgo, J. G., Spez, M, and Sanz, E, Combining text and heuristicz for cost-sensitive spam
filtering. In Proc. Of CONL, 2000.
Cohen, W.W., “Learning Rules that Classify E-Mail.”, Proceedings of the AAAI Spring
Symposium on Machine Learning in Information Access, Stanford, California, 1996.
Drucker, H., Wu, D., & Vapnik, V., Support vector machines for Spam categorization.
IEEE-NN, Vol. 10, No.5, pp. 1048–1054, 1999.
Nikolaos Korfiatis, Marios Poulosy, Sozon Papavlassopoulos, Proceeding of the WSEAS
International Conference on Applied Mathematics, Greece, Aug 19, 2004 (488-429).
Peng Liu, Guangliang Chen, Liang Ye, Weiming Zhong, Proceedings of the 5th WSEAS
Int. Conf. On Simulation, Modeling and Optimization, Corfu, Greece, August 17-19, 2005
(pp61-66).
Wanli Ma, Dat Tran, Dharmendra Sharma, Sen Li, Proceedings of the 2007 WSEAS
International Conference on Computer Engineering and Applications, Gold Coast,
Australia, January 17-19, 2007 533.
Sadegh Kharazmi, Ali FarahmandNejad, Proceeding of the 9th WSEAS Int. Conference on
Data Networks, Communications, Computers, Trinidad and Tobago, November 5-7, 2007.
Malik Agyemang, Ken Barker and Rada S. Alhajj, Framework for Mining Web Content
Outliers. In: ACM Symposium on Applied Computing, Nicosia, Cyprus, 2004, pp 590-594.
Malik Agyemang, Ken Barker and Rada S. Alhajj Mining Web Content Outliers using
Structure Oriented Weighting Techniques and N-Grams’ ACM Symposium on Applied
Computing., Santa Fe, New Mexico,2005, pp 482-487.
Malik Agyemang Ken Barker and Rada S. Alhajj WCOND –Mine Algorithm for detecting
Web Content Outliers from Web Documents. IEEE Symposium on Computers and
Communication. 2005.
Malik Agyemang Ken Barker and Rada S. Alhajj, Hybrid Approach to Web Content Outlier
Mining without Query Vector. Springer –Berlin, 2005, Vol. 3589.
G Poonkuzhali, K Thiagarajan and K Sarukesi, Set theoretical Approach for mining web
content through outliers detection International journal on research and industrial
applications, Vol.2, 2009, pp. 131-138.
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
G. Poonkuzhali, K. Thiagarajan, K. Sarukesi and G. V. Uma, Signed approach for mining
web content outliers. Proceedings of World Academy of Science, Engineering and
Technology, Volume 56, pp -820-824.
G. Poonkuzhali,R. Kishore kumar, R. kripa keshav, P. Sudhakar and K. Sarukesi,
Correlation Based Method to Detect and Remove Redundant Web Document, Advanced
Materials Research, Vols. 171-172,2011, pp 543-546.
G Poonkuzhali, K Sarukesi and G V Uma, Detection and Removal of Redundant Web
Document through Rectangular and Signed Approach, International Journal of
Engineering, Science and Technology, Vol. 2 (9)-2010, pp 4126-4132.
K. Thiagarajan, A. Raghunathan, Ponnamal Natarajan, G. Poonkuzhali and Prashant
Ranjan, Weighted Graph Approach for Trust Reputation Managements, International
Conference on Intelligent Systems and Technologies, Published in Proc. Of World
Academy of Science and Technology- Vol 56, 2009, pp-830-836.
G.Poonkuzhali, K.Thiagarajan, P.Sudhakar, R.Kishore Kumar, K.Sarukesi, “Spam
Filtering using Signed and Trust Reputation Management”, Recent Researches in Applied
Computer and Applied Computational Science.
Giuseppe Antoio Di Lucca, Massimiliano and Anna Rita Fasolina, An Approach to identify
duplicated web pages. In: proceedings of the 28th Annual International Computer Software
and Applications Conference, IEEE computer Society press, 2002.
Min-yan Wang and Dong-Sheng Liu, The Research of web page De-duplication based on
web pages Re-shipment Statement. First International Workshop on Database Technology
and Applications, 2009,pp.271-274
Yunhe Weng, Lei Li and Yixin Zhong, Semantic keywords-based duplicated web pages
removing, IEEE, 2008.
Zhongming Han, Qian Mo, Liu and Jianzhi, Effectively and Efficiently Detect Web Page
Duplication, IEEE, 2009.
Download