International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Adaptive Classifier and Associative Algorithms for phishing detection Ch.sonika #1 , Mrs.D.Raaga Vamsi#2 1 2 M.Tech (CSE),Gudlavalleru Engineering College, Gudlavalleru Professor, Gudlavalleru Engineering College, Gudlavalleru. ABSTRACT: Phishing serves as a social engineering crime generally known as impersonating a trusted third party to gain access to private data. Data Mining (DM) Techniques might be a very useful methodology for identifying and detecting phishtank phishing websites. Using this proposed system, we present a novel approach to overcome the challenge and complexity in detecting and predicting offline phishing data. We proposed an intelligent effective model that really based on using improved classification like improvedC4 5, PRISM ,PART and association Mining algorithms MCAR.This strategy uses different classification algorithm and techniques to extract the phishing training dataset to sort out their legitimacy. We also compared their performances, accuracy, range of rules generated. The rules generated direct from associative classification model showed affiliation between some important characteristics of phishtank data. The experimental results shows better performance compared to other traditional classifications algorithms. 1 INTRODUCTION stolen. These can allow the impression the fact that the redirect is to the legitimate site (as an alternative to a spoofed mock site) however in fact this is not always possible. Currently, human reviewers maintain some blacklists, much like the one published by PhishTank . With Phish- Tank, the user communitymanually verifies potential phishing pages submitted by community members to keep their blacklist mostly error-free. Unfortunately, this review process takes a considerable amount of time, ranging from a median of over ten hours in March, 2009 to a median of over fifty hours in June, 2009, according to PhishTank’s statistics. Omitting verification to improve the timeliness of the data is not a good option for PhishTank. Without verification, the list would have many false positives coming from either innocent confusion or malicious abuse. existing anti-phishing techniques, whether third party certification based , password based or URL based are not robust enough for phishing detection. The Phishing Phase can be described as a three step process: Planning, Attack, and Fraud. Each action can be referred to as shown below. Phishing is an online form of pretexting, a kind of deception in which an attacker pretends to be someone else in order to obtain sensitive information from the victim. Phishing is a significant practical problem, with reported accumulated loss of $3.2 billion in 2007 . Due to the immediate monetary rewards from the sensitive information stolen (e.g. user account name and password), financial institutions such as PayPal, eBay, and banks have been the primary brands affected by phishing attacks. The typical communication medium phishing attacks use is email, forged to look like it is from a legitimate organization. The email usually informs the victims of a problem with their account and directs them to take remedial action by entering personal information or logging into their account at a fake website. Although phishing has already spread to other online media, such as instant messaging, where phishing was first reported , and voice-over-IP (also known as vishing), the focus of this study is email-based phishing due to its prevalence and data availability. 1. Planning. In this particular stage the opponent determines the victim to attack; the knowledge to be got from the victim; and how to obtain this information. Social techniques are employed to gain information regarding the target victim. Various media, by way of example phone, instant messenging, clients, email, and the Internet, can be used to gain this information. Phishing: A social engineering-based spoofing attack whereby a user is sent an email and spoofed into clicking on a link to the bogus website where personal information encompassng passwords and account numbers often is This three-step process does not stop after one attack. It's a continuing process wherein the attacker repeats the exact steps with another unsuspecting victim. 2. Attack. This phase involves delivery of the phishing message and luring the victim to surrender his/her credentials. Email is a popular method utilized deliver the phishing message to the target. The layout the most typical phishing email is shown in Figure 1. This mail is targeted at gaining financial information from eBay clients. 3. Fraud. The final step of the attacker is fraud. The attacker uses the information obtained in the attack phase to purchase goods, steal money from the victims account and identity theft. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 610 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Association rules mining is a technique for finding interesting rules from transactional databases [1]. Finally it was initially previously used to reveal associations in commercial data issued from a database of transactions each representing the set of items purchased by a customer. The association analysis identifies items purchased together. Association rules mining finds correlation between the attributes. In our case, we are actually aiming to find the correlation between the attributes inside of the network data. As correlation differs between attributes of the network data, association rules gives the flexibility of determining different relations. The results of the mining are expressed as rules. Phishing can still happen at sites that don't support twofactor authentication. Sensitive information that is not linked to a particular site, e.g., credit card information and SSN, cannot be protected by this approach either [20]. Existing algorithm uses threshold methods in order to classify the phishing results. Another approach is to use security toolbars. The phishing filter in IE7 [18] is naturally a toolbar approach that have features encompassng blocking the user„s activity by having detected phishing site. Other approach is to visually differentiate the phishing sites direct from spoofed legitimate sites. Dynamic Security [5] proposes to make use of a randomly generated visual hash to customize the browser window or web form elements to point the successfully authenticated sites. A fourth approach is two- factor authentication, which means the owner not exclusively knows a secret but as well as supplies a security token [6]. However, this method is naturally a server-side solution. in the existing system. Url is not normalized in this system. Email messages and url’s are recognized are treated 2 BACKGROUND AND RELATED WORK Gartner[2] estimated the costs at $1,244 per victim, a boost during the $257 they cited in a 2004 report [1]. In 2007, Moore and Clayton estimated how many phishing victims by examining web server logs. They estimated that 311,449 people fall for phishing scams annually, costing around 350 million dollars [2]. There are numerous promising defending approaches to this problem reported earlier. One approach usually is to stop phishing for the email level [3], since the majority of current phishing attacks use broadcast email (spam) to lure victims to some phishing website. Another approach is to use security toolbars. The phishing filter in IE7 [4] is a toolbar approach that come with features an example would be blocking the user's activity utilizing a detected phishing site. A 3rd approach usually visually differentiate the phishing sites direct from spoofed legitimate sites. Dynamic Security Skins [5] proposes to utilize a randomly generated visual hash to customize the browser window or web form elements to point out the successfully authenticated sites. A fourth approach is two-factor authentication , which means the reader simply not only knows a secret but in addition supplies a security token [6]. Many industrial antiphishing products use toolbars in Web browsers, though some researchers have shown that security tool bars don't effectively prevent phishing attacks. Yet another approach usually employ certification, e.g., Microsoft spam privacy [7]. A variant of web credential usually is to utilize a database or list published by your trusted party, where known phishing web sites are blacklisted. The weaknesses of this approach are its poor scalability and its timeliness. The latest variety of Microsoft's Internet Explorer supports Extended Validation (EV) certificates, coloring the URL bar green and displaying the name of all the company. However, legally to have found that EV certificates didn't make users less are taken in by phishing attacks [8]. Data collection is very difficult as one in this system. This system is outfitted if the data is large. Existing algorithms are not suitable to all types of attributes. 3. PROPOSED FRAMEWORK \ Two publicly available datasets were used to test our implementation: the "phishtank" from the phishtank.com [3] which is considered one of the primary phishing-report collections. The PhishTank database records the URL for the suspected website that has been reported, the time of that report, and sometimes further detail such as the screenshots of the website, and is publicly availableWe used a series of short scripts to programmatically extract the above features, and store these in an excel sheet for quick reference. The practical part of this comparative study utilizes six different common OM classification algorithms (C4.5, PART, PRISM and MCAR). Our choice of these methods is founded on the different strategies they applied to learning rules from data sets. PART algorithm is based on on account that it combines both approaches to generate a set of rules. PRISM is naturally a classification rule that may only trot out nominal attributes and doesn't do any pruning. Finally, MCAR algorithm involves two phases: rules generation and a classifier builder. In the initial phase, MCAR scans the training data set to discover frequent single items, after which recursively combines their products ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 611 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 generated to produce items involving more attributes. MCAR then generates ranks and stores the rules. Supplied in the second phase, the foundations are utilized to build a classifier by considering their effectiveness on the training data set. <url>http://200.32.240.252/ss2/?https://chaseonline.chase.com/publ ic/login/Logon.aspx?LOB=COLLogon3002776sy266Y25d6t2t7jej6 6263DSHAhsd27ghG72GSH22s662gsg2</url> <phish_id>1542621</phish_id> <phish_detail_url>http://www.phishtank.com/phish_detai l.php?phish_id=1542621</phish_detail_url> <details> <detail> <ip_address>200.32.240.252</ip_address> <cidr_block>200.32.240.0/21</cidr_block> <announcing_network>10269</announcing_network> <rir>lacnic</rir> <detail_time>2012-08-27T05:29:49+00:00</detail_time> </detail> </details> <submission> <submission_time>2012-0827T05:29:19+00:00</submission_time> </submission> <verification> The third step: Set r = 1, where r is used to retain the current number of items in an itemset. STEP 4: Generate the candidate set Cr 1 from Lr in just a way a dead ringer for that in the Apriori algorithm except that this supports of every the big ritemsets comprising each candidate (r 1)-itemset Ik have to be larger than or corresponding to the maximum (denoted as mIk) of one's minimum supports of items during these large r-itemsets. STEP 5: Calculate the count cIk just about every one of them candidate (r 1)-itemset Ik in Cr 1, as its occurrence number in the transactions; derive its support value sIk as: STEP 6: Check whether the support sIk of each candidate (r +1)-itemset Ik is larger than or equal to mIk (obtained in STEP 4). If Ik satisfies the above condition, put it in the set of large (r+1)-itemsets (Lr+1). STEP 7: IF Lr+1 is null, do the next step; otherwise, set r = r+1 and repeat STEPs 4 to 7. STEP 8: Construct the association rules for each large qitemset Ik with items {Ik1 , Ik2 , . . . , Ikq} by the following substeps: (a) Form all possible associaton rules as follows: Ik1 Ik2 Ik3 Ik4….. Ikq --> Ikj ,j=1 to q (b) Calculate the confidence values of all association rules using forumula: SIk ------------------------------SIk1…… Ikj-1 Ikj+1… Ikq STEP 9: Output the rules with confidence values larger than or equal to the predefined confidence value lamda. <verified>yes</verified> IMPROVED C45: <verification_time>2012-0827T06:38:45+00:00</verification_time> </verification> <status> <online>yes</online></status> Multiple Minimum Supports Using Maximum Constraints Association rule mining algorithm: INPUT: A set of n transaction data T, a set of p fields to be detected, each feature is utilizing a minimum support value mi, i = 1 to p, as well as a minimum confidence value. Attribute Selection: Apply Attribute selection to each attribute(L, attribute list) to find the “best” splitting criterion; Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information is selected. Given a collection S of c outcomes The expected information needed to classify a tuple in D is given by In the kdd99 dataset we have two class labels ie normal and anomaly.Hence Modified OUTPUT: A set of association rules in the whole criterion of all the maximum values of minimum supports. 1st step:Appearance smell: Calculate the count ck just about every one of them item tk, k=1 to p, as its occurrence number in the transactions; derive its support value stk as: Information or entropy is given as m ModInfo(D)= Si l og Si ,m different classes l og Si i 1 2 ModInfo(D)= Si i 1 = S1 log S1 S2 log S2 Where S1 indicates set of samples which belongs 2nd step: Check whether the support stk of each and every item tk is larger than or equivalent to its predefined minimum support value mtk . If tk satisfies these activities condition, call it in the whole set of large 1-itemsets (L1). to target class ‘anamoly’, S2 indicates set of samples which belongs to target class ‘normal’. Information or Entropy to each attribute is calculated using ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 612 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 v InfoA ( D ) Di / D ModInfo( Di ) i 1 The term Di /D acts as the weight of the jth partition. ModInfo(D) is the expected information required to classify a tuple from D based on the partitioning by A. Information gain is defined as the difference between the original information requirement) and the new requirement .That is, Correctly Classified Instances Incorrectly Classified Instances 63 0 100 % 0 % 62 1 98.4127 % 1.5873 % === Confusion Matrix === a b c <-- classified as 21 0 0 | a = Czech Republic 0 20 0 | b = AZ, US 0 0 22 | c = US Gain( A) Mod inf o( D) inf oA ( D) === Stratified cross-validation === Finding Best Split: Correctly Classified Instances Incorrectly Classified Instances In order to decide which attribute is best split measure ,correlation coefficient is used as threshold as === Confusion Matrix === r XY XY SDx . SD y Let A= MaxGain{AttributeList If(r>0 and A>r }) { A is positively alerted and the node is selected. } Elseif(r<0 and A>r) { A is negatively alerted and the node is discarded. } Elseif(r=0 and A>r) { A is unalerted and next highest MaxGain is selected. } Else A is discarded Depending on the alert type severity, the decision on the root node and the child nodes are selected. Recurs on the sub lists obtained by splitting on a best, and add those nodes as children of node. 4 Experimental Results: All experiments were performed with the configurations Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the operating system platform is Microsoft Windows XP Professional (SP2). Improved C45 SPAM tree Number of Leaves : 3 Size of the tree : 5 a b c <-- classified as 20 0 1 | a = Czech Republic 0 20 0 | b = AZ, US 0 0 22 | c = US MultiClassClassifier Classifier 1 Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Time taken to build model: 0.22 seconds Time taken to test model on training data: 0.02 seconds === Error on training data === Correctly Classified Instances Incorrectly Classified Instances 21 3 87.5 % 12.5 % 14 10 58.3333 % 41.6667 % === Confusion Matrix === a b <-- classified as 14 2 | a = YES 1 7 | b = NO === Stratified cross-validation === Correctly Classified Instances Incorrectly Classified Instances === Confusion Matrix === a b <-- classified as 10 6 | a = YES 4 4 | b = NO Time taken to build model: 0.14 seconds Time taken to test model on training data: 0.05 seconds === Error on training data === ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 613 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 PART decision list URL of Web Content: =http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorpSignUp=&path=gp=Fyourst orehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm: NO (8.0/3.0) Apparent Sender = ABBY: YES (12.0/1.0) Location: = Czech Republic: NO (2.0) : YES (2.0) Number of Rules : 4 Time taken to build model: 0.13 seconds Time taken to test model on training data: 0.02 seconds === Error on training data === Correctly Classified Instances Incorrectly Classified Instances 20 4 83.3333 % 16.6667 % === Confusion Matrix === a b <-- classified as 13 3 | a = YES 1 7 | b = NO === Stratified cross-validation === Correctly Classified Instances 15 Incorrectly Classified Instances 9 Kappa statistic 0.069 If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk and Email Format = HTML and Apparent Sender = Amazon and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp SignUp=&path=gp=Fyourstorehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm and Location: = US and Detailed server information = 209.61.245.164 detailed server information then YES If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk and Email Format = XML then NO If Apparent Sender = Amazon and Return Address: = account-update@ ABBY.co.uk then NO If URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp SignUp=&path=gp=Fyourstorehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm and Email Format = HTML and Apparent Sender = ABBY then NO If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk and Apparent Sender = Amazon and Email Format = HTML and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp SignUp=&path=gp=Fyourstorehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm and Location: = US and Detailed server information = 209.61.245.164 detailed server information then NO If Return Address: = ABBY.co.uk < auto-confirm@ ABBY.co.uk and Apparent Sender = Amazon and Email Format = HTML and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp SignUp=&path=gp=Fyourstorehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm and Location: = US and Detailed server information = 209.61.245.164 detailed server information then NO If URL of Web Content: = http://72.167.205.59/gp/index.php and Apparent Sender = ABBY and Email Format = HTML then NO Time taken to build model: 0.04 seconds Time taken to test model on training data: 0.02 seconds 62.5 % 37.5 % === Error on training data === Correctly Classified Instances Incorrectly Classified Instances 22 2 91.6667 % 8.3333 % 14 6 58.3333 % 25 % === Confusion Matrix === a b <-- classified as 13 3 | a = YES 6 2 | b = NO === Confusion Matrix === a b <-- classified as 16 0 | a = YES 2 6 | b = NO Prism rules If URL of Web Content: = http://ordersuk.szm.sk/sing. varzea-loginamazon.htm then YES If Detailed server information = 56.167.205.59 detailed server information then YES If Return Address: = account-update1@ amazon.co.uk then YES If Return Address: = account-update1@ ABBY.co.uk then YES If URL of Web Content: = http://72.167.25.59/gp/index.php then YES If Apparent Sender = ABBY and Email Format = XML then YES If Return Address: = ABBY.co.uk < auto-confirm@ ABBY.co.uk and Apparent Sender = Amazon and Email Format = HTML and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp SignUp=&path=gp=Fyourstorehome& redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm and Location: = US and Detailed server information = 209.61.245.164 detailed server information then YES === Stratified cross-validation === Correctly Classified Instances Incorrectly Classified Instances === Confusion Matrix === a b <-- classified as 10 3 | a = YES 3 4 | b = NO ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 614 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 150 100 correctly incorrectly 50 0 PRIS MCA IMPR PART M R OVE correc 91.67 83.33 87.5 tly 100 incorre 8.333 16.67 12.5 ctly 0 Proposed algorithm performance evaluation 5. CONCLUSION AND FUTURE WORK This proposed work mainly identifies several new and generic features for identifying phishing URLs. Proposed Improvedc45 classifier and MCAR achieves a very high accuracy. One of the major contributions of this work is a large scale measurement study conducted on phishtank web urls. Experimental results shows basic identification of phishing attacks and non phishing attacks based on the features of different types of features. Data mining algorithms require an offline training phase, but the testing phase requires much less time and future work could investigate how well it can be adapted to performing online phishing web classification. [6] Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule Mining." Proceedings ofthe Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98, Plenary Presentation), New York, USA, 1998. [7] '1', Fadi, c.Peter and Y. Peng, "MCAR: Multi-class Classification based on Association Rule", IEEE International Conference on Computer Systems and Applications ,2005, pp. 127-133. [8] WEKA - University of Waikato, New Zealand, EN, 2006: "Weka -Data Mining with Open Source Machine Learning Software in Java", 2006 . [9] A. Y. Fu, L. Wenyin and X. Deng, ― Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover„s Distance (EMD) , IEEE transactions on dependable and secure computing, vol. 3, no. 4, 2006. [10] B. Adida, S. Hohenberger and R. Rivest , "Lightweight Encryption for Email," USENIX Steps to Reducing Unwanted Traffic on the Internet (SRUTI), 2005 . [11] T. Sharif, "Phishing Filter in IE7," http://blogs,msdn.com/ie/archive/2005 /09/09/463204,as px,,2006, REFERENCES: [1] R. Dhamija and J.D. Tygar, ―The Battle against Phishing: Dynamic Security Skins, Proc. Symp. Usable Privacy and Security, 2005. [2] Real Time and Offline Network Intrusion Detection using Improved Decision Tree Algorithm G. Sunil Kumar, International Journal of Computer Applications (0975 – 8887) Volume 48– No.25, June 2012 [3] FDIC., ―Putting an End to Account-Hijacking Identity Theft, http://www.fdic.gov/consumers/consumer/idtheftstud y/identity_theft.pdf, 2004. [4] Associative Classification Techniques for predicting eBanking Phishing Websites , Maher Aburrous. [12] J. Cendrowska., "PRISM: An algorithm for inducing modular rule", International Journal of Man-Machine Studies (1987), Vo1.27, No.4, pp.349-370. [5] J. R. Quinlan, "Improved use of continuous attributes in c4.5", Journal of Artificial Intelligence Research, 4:7790, 1996. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 615