A Mathematical Approach for Filtering Junk E-Mail using Relevance Analysis S.SathyaBama Assistant Professor, Department of MCA, Sri Krishna College of Technology, Coimbatore, Tamil Nadu, INDIA Mobile: 98655 33391 ssathya21@gmail.com M.S.Irfan Ahmed Director, Department of MCA, Nehru Institute of Engineering and Technology, Tamil Nadu, INDIA Mobile: 90037 50009 msirfan@gmail.com A.Saravanan Assistant Professor, Department of MCA, Sri Krishna College of Technology, Coimbatore, Tamil Nadu, INDIA Mobil: 98420 06163 a.saravanan21@gmail.com Abstract In this growing information technology world, all most all standard communication occurs through e-mails. Managing a mailbox has thus become a vastin this e-world.Genuinely most of the inboxes are flooded with spam e-mails, especially, when the user linked with social networks. This may lead to many sort of attacks. These are harmful and have offensive comment. Due to the low cost involved in sending E-mails, Companies and several people send bulk messages in the form of spam.Thus it is necessary improve the spam controlalgorithms on various aspects. This paper presents a mathematical approach to restrictthe spam e-mails through subject and content relevancy of the e-mail. Results of this approach is used to classify the e-mail to be spam. Keywords- E-mail, E-mail spam, content relevancy 1. Introduction Due to the rigorous use of internet, E-mail is undoubtedly a very effective means of communication which is considered to be cheap and easy. Due to the low cost involved in sending E-mails, companies and several people send bulk messages in the form of spam [1]. Spam is also known as unsolicited Commercial Email (UCE) and unsolicited Bulk Email (UBE) or junk mail [2]. It is the practice of sending unwanted e-mail messages, frequently with commercial content, in large quantities to an indiscriminate set of recipients. However, the user is confronted with so many unwanted e-mails where they miss their important e-mails just because their mailbox space is often eaten up by these unwanted e-mails. Spam in e-mail started to become a problem when the Internet was opened up to the general public in the mid-1990s. It grew exponentially over the following years, and today composes some 80 to 85% of all the e-mail in the world [3]. Pressure to make e-mail spam illegal has been successful in some jurisdictions, but less so in others [4]. Spammers take advantage of this fact, and frequently outsource parts of their operations to countries where spamming will not get them into legal trouble. Generally, Spammers collect e-mail addresses from chat rooms, websites, customer lists, newsgroups, and viruses which harvest users' address books, and are sold to other spammers. They also use a practice known as e-mail appending or e-pending in which they use known information about their target (such as a postal address) to search for the target's e-mail address. Spam in the past contains known strings or patterns which are not necessary for the user. Unfortunately, majority of e-mail clients now render Hypertext Markup Language (HTML) based e-mails, allowing spammers many opportunities to fool the filters. Content based filters require never-ending tuning and adjustment in order to keep up with the spammers’ latest tricks. Consequently, there are many organizations, as well as individuals, who have taken it upon themselves to fight spam with a variety of techniques. But, because the Internet is public, there is really little that can be done to prevent spam, just as it is impossible to prevent junk mail. There are several algorithms available for detecting and filtering spam emails. Content based classification, analyzes the contents and packet of an e-mail using Bayesian networks [5] or pattern matching [6]. Among the existing algorithms, Bayesian filtering [7, 8] produces best result, still it does not detect all the spam emails. Most of the existing algorithms considers content alone for filtering the spam e-mails. To detect all the spam e-mails, existing spam filtering methods has to be enhanced. Another approach is Domain Keys Identified Mail (DKIM) [9], which associates a responsible identity with each e-mail. Allowing the receiver to confirm the sender and origin of the email. Unfortunately, this system does not prevent the bot from using the identities stored on the hijacked computer and sending email through the domain’s relays. It does however, make it easier to identify the source of the email. Adoption, as in many other cases, may prove to be the biggest hurdle for DKIM. Thus, this paper presents the weight based relevance score for filtering the spam messages. This weight based approach classifies the e-mail and directs them to inbox or sapm folder of the user. The work of the paper is described as follows. Section 2, reveals the literature survey made related to the proposed work. Section 3 describes the proposed architecture and section 3.1 presents the proposed algorithm for spam filtering. Section 4 explains the experimental results and finally Section 5 provides the conclusion. 2. Related work Carreras et al. [10] proposed a Boosting algorithm for AntiSpam filtering in which possibility of misclassification costs persist. Hidalgo et al. [11] presents a new dimension for spam e-mail classification. Methods based on speech act theory and support vector machine are developed for spam categorization [12, 13]. Even though support vector approach outperforms well, switching from training model need user intervention and reply e-mails are considered as no spam. Nikolos et al. [14] implemented new technique for spam categorization couple with header information and content information. Even though the conceptualization is good, but the practical bottle neck will comes for identification of spam words from the global set which takes more time. Peng et al. [15] proposed spam filter in distributed environment. Wanli et al [16] projected a new techniques for identifying spam e-mail of content type image. Bayesian spanning tree with Likelihood function to identify the e-mail in the e-mail space is proposed [17]. Several methods were proposed for web content outlier mining. The knowledge from these algorithms can be used in spam detection which is considered as the outlier in e-mails. Malik Agyemang et al establish the presence of outliers on the web with various types of outliers present on the web and designed a framework with hybrid approach for mining web content outliers using full word matching [18, 19, 20, 21]. G.Poonkuzhali et al presented the mathematical approach based on set theoretical, signed approach, rectangular and correlation approach for mining web content outliers [22, 23, 24, 25]. K.Thiagarajan et al. implemented weighted graph approach of trust reputation management through signed concept which can also be applied for retrieving the relevant content, SMS and SPAM filtering [26]. Spam filtering using signed and trust reputation management is presented in [27]. Spam filtering using Fuzzy approach is proposed in [3]. Giuseppe Antoio Di Lucca et al. proposed an algorithm based on clone detection and similarity metrics to detect duplicate pages in web sites and application implemented with HTML which works only for structured web documents [28]. Min-yan Wang et al. suggested a web page de-duplication method in which the information including original websites and web titles are extracted to eliminate duplicated web pages based on feature codes with the help of URL hashing [29]. Yunhe Weng et al. come up with an idea of improved COPS (Copy Detection Algorithm) scheme which aims to protect intelligent property of the document owner by detecting overlap among documents [30]. Zhongming Han et al. developed a novel multilayer framework for detecting duplicated web pages through two similarity text paragraphs detection algorithms based on Edit distance and bootstrap method [31]. Thus many of the existing approaches are application centric and not user centric. Also classification of normal and spam e-mails is having more preference over management of e-mails chosen by the user 3. Proposed Architecture The Figure 1 shows the architectural design of the proposed Spam Filtering System. When an e-mail arrives to the proposed system, it pass through four phases for spam filtering. They are i) Pre-processing phase ii) Weight assigning phase iii) Relevancy analysis phase and iv) Decision making phase. Figure 1. Proposed Architectural Design for Spam Filtering Whenever the E-mail is received, the subject and content of the e-mail is pre-processed. Before pre-processing, the user can move the mail to spam if it is identified. This phase transforms the extracted content into the structured form which improves the efficiency of the entire phase. In the next weight assigning phase, the weights will be assigned separately for mail id, subject and content of the mail. Based on the assigned weights, the relevance score will be calculated. Based on the relevance score a final decision will be made to categorize normal and spam email. Relevancy Score The weight for the email was given based on the sender email address available in the receiver’s address book. If the senders address is available on the receiver’s address book, the value 1 will be assigned, if it is there in the spam list value -1 is assigned else 0 is assigned. Then the contents are pre-processed. The weight is assigned for the subject and the content based on the white list to find the relevancy of the received mail. Finally the mean of weights will be calculated as a relevance score. The algorithm is given in section 3.1. Final decision was made that recommends whether the received e-mail is to be placed in inbox or spam. 3.1 Algorithm for Spam Filtering Input: Received E-Mail Output: Decision based on Relevance Score Step 1: Check the sender mail address in address book. If there is a match, assign W1=1; If not, check the address in Spam list. If found assign W1=-1; else assign W1=0; Step 2: Pre-process the terms in the subject and content separately by removing stop words and perform stemming. Step 3: For each STi, If match is found in White List, then increment W2 by 1. Else if match found in the Black List decrement W2 by -1 Else, increment W2 by 0.5. W2 = W2 / n1, STi be the list of terms in the subject where 1≤ i ≤ n1; n1 is the total words in the subject. Step 4: For each CTi, If match is found in White List, then increment W3 by 1. Else if match found in the Black List decrement W3 by -1 Else, increment W3 by 0.5. W3 = W3 / n2, STi be the list of terms in the subject where 1≤ i ≤ n2; n2 is the total words in the content. Step 5: Calculate the Relevance Score. RS = (W1+W2+W3)/3 Step 6 If RS ≥ 0.5, Move the mail to Inbox by updating the white list, else move it to Spam and update the black list. 4. Experimental Results with Verification The proposed algorithm is compared with ID3 algorithm. The decision made by the proposed method is same as the result produced by ID3. The sample result is shown in the Table 1. Email ID (W1) 1 0 1 0 0 1 1 0 0 1 Subject Content Relevance Score Result (W2) (W3) (RS) 0.83 0.87 0.9 Inbox 0.17 0.47 0.21 Spam 0.87 0.6 0.83 Inbox 0.40 0.53 0.31 Spam 0.33 0.67 0.33 Spam 0.33 0.67 0.67 Inbox 0.83 0.59 0.81 Inbox -0.2 0.5 0.10 Spam 0.83 0.67 0.5 Inbox 0.40 0.73 0.71 Inbox Table 1.Result for sample dataset From the above table it is clear that this weight based approach identifies the Spam mails. 5. Conclusion This paper proposes weight based approach for spam detection. It is based on the Mail Id, terms in the subject and content. The proposed approach out performs in terms of accuracy in deducting spam e-mails than the existing approaches. The proposed approach works only for e-mails having subject and body content as plain text. Future work aims at deducting spam mails having images. Reference [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Spam and Social technical gap – IEEE Computer, Vol 37 Oct 2004. Tom Fawcett,”In vivo. Spam Filtering: A challenge problem for datamining” KDD Explorations vol.5 no.2, Dec 2003. pp.140-148 P.Sudhakar1, G.Poonkuzhali2, K.Thiagarajan 3, K.Sarukesi4, “Terminator for E-mail Spam - A Fuzzy Approach Revealed”, International Journal of Computers, Issue 3, Volume 5, 2011. E-mail Metrics report http://www.maawg.org/email_metrics_report Grahm P. A plan for spam. In Reprinted in Paul Graham,Hackers and Painters, Big Ideas from the Computer Age, O’Really, 2004. Showalter, T. RFC 3028 – Sieve: A MailFilteringLanguage. http://tools.ietf.org/html/rfc3028, 2001. Sahami M, Dumais S, Heckerman D and Hortivz E, A bayesian approach to filtering junk e-mail. In Workshop on Learning for Text Categorization - AAAI-1998. Graham, P., Better Baysian Filtering. In Proceedings of Spam Conference http://spamconference.org/proceedings2003.html, 2003. Allman E, Callas J, Delany M, Libbey M, Domain keys identified mail (DKIM) signatures. http://www.ietf.org/internetdrafts/draft-ietfdkim-base-10.txt, 2007. Carreras, X. and Mdrquez, L., “Boosting trees for anti-spare e-mail filtering”, In Proc. of RANLP, 2001. Hidalgo, J. G., Spez, M, and Sanz, E, Combining text and heuristicz for cost-sensitive spam filtering. In Proc. Of CONL, 2000. Cohen, W.W., “Learning Rules that Classify E-Mail.”, Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996. Drucker, H., Wu, D., & Vapnik, V., Support vector machines for Spam categorization. IEEE-NN, Vol. 10, No.5, pp. 1048–1054, 1999. Nikolaos Korfiatis, Marios Poulosy, Sozon Papavlassopoulos, Proceeding of the WSEAS International Conference on Applied Mathematics, Greece, Aug 19, 2004 (488-429). Peng Liu, Guangliang Chen, Liang Ye, Weiming Zhong, Proceedings of the 5th WSEAS Int. Conf. On Simulation, Modeling and Optimization, Corfu, Greece, August 17-19, 2005 (pp61-66). Wanli Ma, Dat Tran, Dharmendra Sharma, Sen Li, Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 533. Sadegh Kharazmi, Ali FarahmandNejad, Proceeding of the 9th WSEAS Int. Conference on Data Networks, Communications, Computers, Trinidad and Tobago, November 5-7, 2007. Malik Agyemang, Ken Barker and Rada S. Alhajj, Framework for Mining Web Content Outliers. In: ACM Symposium on Applied Computing, Nicosia, Cyprus, 2004, pp 590-594. Malik Agyemang, Ken Barker and Rada S. Alhajj Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-Grams’ ACM Symposium on Applied Computing., Santa Fe, New Mexico,2005, pp 482-487. Malik Agyemang Ken Barker and Rada S. Alhajj WCOND –Mine Algorithm for detecting Web Content Outliers from Web Documents. IEEE Symposium on Computers and Communication. 2005. Malik Agyemang Ken Barker and Rada S. Alhajj, Hybrid Approach to Web Content Outlier Mining without Query Vector. Springer –Berlin, 2005, Vol. 3589. G Poonkuzhali, K Thiagarajan and K Sarukesi, Set theoretical Approach for mining web content through outliers detection International journal on research and industrial applications, Vol.2, 2009, pp. 131-138. [23] [24] [25] [26] [27] [28] [29] [30] [31] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi and G. V. Uma, Signed approach for mining web content outliers. Proceedings of World Academy of Science, Engineering and Technology, Volume 56, pp -820-824. G. Poonkuzhali,R. Kishore kumar, R. kripa keshav, P. Sudhakar and K. Sarukesi, Correlation Based Method to Detect and Remove Redundant Web Document, Advanced Materials Research, Vols. 171-172,2011, pp 543-546. G Poonkuzhali, K Sarukesi and G V Uma, Detection and Removal of Redundant Web Document through Rectangular and Signed Approach, International Journal of Engineering, Science and Technology, Vol. 2 (9)-2010, pp 4126-4132. K. Thiagarajan, A. Raghunathan, Ponnamal Natarajan, G. Poonkuzhali and Prashant Ranjan, Weighted Graph Approach for Trust Reputation Managements, International Conference on Intelligent Systems and Technologies, Published in Proc. Of World Academy of Science and Technology- Vol 56, 2009, pp-830-836. G.Poonkuzhali, K.Thiagarajan, P.Sudhakar, R.Kishore Kumar, K.Sarukesi, “Spam Filtering using Signed and Trust Reputation Management”, Recent Researches in Applied Computer and Applied Computational Science. Giuseppe Antoio Di Lucca, Massimiliano and Anna Rita Fasolina, An Approach to identify duplicated web pages. In: proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE computer Society press, 2002. Min-yan Wang and Dong-Sheng Liu, The Research of web page De-duplication based on web pages Re-shipment Statement. First International Workshop on Database Technology and Applications, 2009,pp.271-274 Yunhe Weng, Lei Li and Yixin Zhong, Semantic keywords-based duplicated web pages removing, IEEE, 2008. Zhongming Han, Qian Mo, Liu and Jianzhi, Effectively and Efficiently Detect Web Page Duplication, IEEE, 2009.