International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016
#1
2
Abstract — In today’s technical world each one of us is connected with each other through various forms of communication medium. One among them is internet. Internet is used to communicate with friends and family. These days most of the business transactions are also executed via internet. People make use of internet for paying bills, making online shopping, etc. They keep their information like address, credit card numbers, telephone, etc. saved for transactions. Many hackers across the world utilize these kinds of services offered by internet to carry out many kinds of cyber-attacks. One such attack is a Phishing attack. In a phishing attack a user gets emails from attackers as if the mails are coming from legitimate organizations and asks the user to undergo a transaction. If the user is unaware of such kinds of attacks carries out the transaction and becomes a victim of the phishing attack. Such attacks have become common and huge sums of money have been lost by users across the world.
Thus, researchers across the world have come up with many anti- phishing techniques. This paper analyses a phishing database record to understand the phishing patterns for a website. Based on my analysis I make use of SVM based classifier, Naïve
Bayes classifier and Random forest based classifier to evaluate the best classifier for the anti- phishing methods. To do so I make use of Weka, R script and data mining techniques.
Keywords — Phishing, data mining, R Script,
WEKA.
I.
#
Master of Technology Scholar & Department of CSE & SITM, Lucknow
I NTRODUCTION prior hostile to phishing frameworks, a percentage of the commercial enterprises have connected these frameworks to shield their association from phishing assault.
In phishing assault, the aggressor makes the site pages that resemble a copy site page of the real sites.
Out of the proposed hostile to phishing strategies, some depend on the machine learning or information mining calculations.
Despite the fact that Web programs (i.e. Mozilla
Firefox, Internet Explorer, Opera, Google Chrome, and so on.) give add-on instrument to blocking phishing messages and phishing destinations,
Phishers still figure out how to override these security systems. Phishing has turned out to be increasingly confounded that Phishers can sidestep the channel set by current hostile to phishing methods [1]. The quickly expanding number of phishing assaults recommends that it is in this way hard to locate a solitary coherent methodology to identify phishing messages and that current hostile to phishing apparatuses are not adequate. This might be ascribed to the for the most part aloof approach of hostile to phishing strategies. The methodologies are aloof since they don't stop the wellspring of the phishing messages rather they essentially group and recognize phishing messages.
The data mining calculations are applied on the arrangement of possible components of phishing that can be separated from the site. As per these elements, the phishing issue can be explained by selecting legitimate/right arrangement of components. In this investigation of anti- phishing, diverse information mining calculations have been applied on the information set of phishing and genuine sites to compare the accuracy of the classifier.
The term Phishing is a sort of mocking site which is utilized for taking delicate and critical data of the web client, for example, web saving money passwords, charge card data and client's secret word and so on. In the phishing assault, the aggressor creates the notice message to the client about the security issues, request secret data through phishing messages, request that overhaul the client's record data and so forth. A few exploratory outline contemplations have been proposed before to countermeasure the phishing assault. The prior frameworks are not giving more than 90 rate effective results. At times, the framework apparatus gives just 50-60 rate effective result [4].
A few exploratory outline contemplations have been proposed in before study to countermeasure the phishing assault. On the premise of the execution of
II.
R ELATED W ORK
There are distinctive existing phishing identification approaches. These methodologies can be further named 1) content-based methodologies that utilization site substance to recognize phishing,
2) non-content based methodologies that don't utilize the substance of the site to distinguish if the email is a credible or phishing email and 3) Visual based methodologies that recognize phishing utilizing the likeness of known destinations through visual assessment [2].
ISSN: 2231-5381 http://www.ijettjournal.org
Page 298
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016
A.
Content Based Approach
In content based approach, phishing attacks are distinguished by looking at site substance.
Highlights utilized as a part of this methodology incorporate catchphrases, spelling mistakes, joins, secret key fields, installed joins, and so forth alongside URL and host based elements [5].
Google's against examining so as to phishing channel identifies phishing and malware page URL, page rank, WHOIS data and substance of a page including HTML, JavaScript, pictures, iframe, and so on [5]. The classifier is always redesigned to suit new phishing locales to adapt up to the most recent systems in phishing assaults. In this approach the classifier might have higher precision however the outcome is not constant. It is utilized logged off since it takes more time to recognize the Phishing
[7]. A few specialists have investigated distinctive methodologies, for example, fingerprinting, chief part examination of pictures, heuristic methodologies and fluffy rationale among others and fluffy rationale based ways to deal with distinguish phishing locales. Our methodology utilizes Fuzzy
Logic dialect descriptors with a scope of qualities for each recognized phishing trademark particularly spelling mistakes, catchphrases and implanted connections. The participation capacity for every trademark inferred as is utilized to evaluate the likelihood that the email is a phishing email [2].
B.
Non-Content Based Approach
Non-content based approaches are essentially based with respect to URL and host data arrangement. URLs are normally ordered in view of components, for example, URL address length and vicinity of unique characters. Besides, have components of URL, for example, IP address, site proprietor, DNS properties and topographical properties are additionally utilized as a part of the order of Phishing messages [5]. The achievement rate is between 95% - 99% even progressively handling [6].
III.
M ETHODOLOGY
At the point when a web client gets to the site, the client hit web address on URL or came to the objective site page from some other site reference joins. For this situation, above all else the URL and its substance ought to be checked then the substance and existing pictures ought to be checked [8]. To check the different purposes of the site, it takes enough time to count the site data with the database data put away in the database of the practical Addon of the web program. In the prior study, program based customer side arrangements have been proposed to alleviate the phishing assaults [2]. A few strategies have likewise been created which endeavor to keep phishing sends from being conveyed [3]. So we ought to have a framework that can immediately check the fed data of the client with the database data of the framework while client sustain the classified data in the site. To make the quick getting to framework, we have characterized the study focuses for the most ideal arrangement.
The contemplated criteria for the phishing have been gathered from the past study [5]. Following are the study points:
TABLE I
ATTRIBUTE AND COLUMN NAME OF PHISHING
WEBSITE DATASET
Attribute
Having IP
Address
Values
{ 1,0 }
Column Name has_ip
Having long url { 1,0,-
1 }
Uses
ShortningService long_url
{ 0,1 } short_service
Having '@'
Symbol
{ 0,1 } has_at
Double slash redirecting
Having Prefix
Suffix
Having Sub
Domain
SSLfinal State
{ 0,1 }
{ -
1,0,1 }
{ -
1,0,1 } double_slash_redirect pref_suf has_sub_domain
Domain registeration length
{ -
1,1,0 }
{ 0,1,-
1 } ssl_state long_domain
Favicon { 0,1 } Favicon
Is standard Port { 0,1 } Port
Uses HTTPS token
{ 0,1 } https_token
Request_URL
Abnormal URL anchor
Links_in_tags
SFH
Submitting to email
{ 1,-1 } req_url
{ -
1,0,1 }
{ 1,-
1,0 } url_of_anchor tag_links
{ -1,1 } SFH
{ 1,0 } submit_to_email
Abnormal URL { 1,0 } abnormal_url
Redirect { 0,1 } Redirect on mouseover
Right Click
{ 0,1 }
{ 0,1 }
Mouseover right_click popUp Window { 0,1 } Popup
Iframe { 0,1 } Iframe
Age of domain { -
1,0,1 } domain_age
DNS Record
Web traffic
Page Rank
Google Index
Links pointing to page
{ 1,0 } dns_record
{ Traffic
1,0,1 }
{ -
1,0,1 } page_rank
{ 0,1 } google_index
{ 1,0,-
1 } links_to_page
ISSN: 2231-5381 http://www.ijettjournal.org
Page 299
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016
Statistical report { 1,0 } stats_report
Result { 1,-1 } Target
Fig. 1 Steps for the validating the classifiers
IV.
R ESULT
The given data set is validated to check the performances of the various classifiers. On the basis of evaluation of the proposed methodology using
WEKA and R-Script following results are obtained.
A.
Naïve bayes classifier output
Time taken to build model: 0.02 seconds
Stratified cross-validation
Correctly Classified Instances 2281 92.8746 %
Incorrectly Classified Instances 175 7.1254 %
Kappa statistic 0.8566
Mean absolute error 0.0843
Root mean squared error 0.2434
Relative absolute error 17.0559 %
Root relative squared error 48.9701 %
Coverage of cases (0.95 level) 96.9055 %
Mean rel. region size (0.95 level) 57.1254 %
Total Number of Instances 2456
Detailed Accuracy by Class
TP Rate FP Rate Precision Recall F-Measure
ROC Area Class
0.949 0.087 0.897 0.949 0.922 0.975 1
0.913 0.051 0.957 0.913 0.934 0.975 -1
Weighted Avg. 0.929 0.067 0.93 0.929
0.929 0.975
Confusion Matrix a b <-- classified as
1038 56 | a = 1
119 43 | b = -1
B.
SVM radial kernel based Classifier
Confusion Matrix and Statistics Reference
Prediction -1 1
-1 332 13
1 8260
Accuracy:
CI :
0.9657 95%
(0.9481, 0.9787)
No Information Rate: 0.5546
P-Value [Acc > NIR] : <2e-16
Kappa: 0.9305
Mcnemar's Test P-Value: 0.3827
Sensitivity:
Specificity:
Pos Pred Value:
Neg Pred Value:
0.9765
0.9524
0.9623
0.9701
Prevalence:
Detection Rate:
0.5546
0.5416
Detection Prevalence: 0.5628
Balanced Accuracy:
'Positive' Class:
0.9644
-1
C.
Random forest based
Confusion Matrix and Statistics Reference
Prediction -1 1
-1 331 10
1 9 263
Accuracy: 0.963 95%
CI: (0.952, 0.9812)
No Information Rate: 0.5546
P-Value [Acc > NIR] : <2e-16
Kappa: 0.9372
Mcnemar's Test P-Value: 1
Sensitivity: 0.9735
Specificity:
Pos Pred Value:
Neg Pred Value:
0.9634
0.9707
0.9669
Prevalence:
Detection Rate:
0.5546
0.5400
Detection Prevalence: 0.5563
Balanced Accuracy: 0.9684
'Positive' Class: -1
ISSN: 2231-5381 http://www.ijettjournal.org
Page 300
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016
Fig. 2 Overall statistics of the given classifier is given in figure below Figure Labels
V.
C ONCLUSIONS
The phishing attack, the user sends their confidential information on mimic websites, so the user should be informed immediately about the category of website. The aim of this research work is to predict whether the given URL is a phishing website or not. This work collects the dataset of UCI machine learning dataset and creates a R script and uses interface of WEKA to evaluate various types of classifier over the given dataset.
The result concludes that SVM based classifier and Random forest based classifier are the best classifier which can be employed in making an anti-
Phishing inbuilt tool or function for any web browser.
R EFERENCES
[1] Ahmed Abbasi, Fatemeh “Mariam” Zahedi and Yan Chen,
“
Impact of Anti-Phishing Tool Performance on Attack
Success Rates”,
10th IEEE International Conference on
Intelligence and Security Informatics (ISI) Washington,
D.C., USA, June 11-14, 2012.
[2] A. Abbasi and H. Chen,
“A Comparison of Fraud Cues and Classification Methods for Fake Escrow Website
Detection,”
Information Technology and Management,
Vol. 10(2), pp. 83-101, 2009.
[3] G. Bansal, F. M. Zahedi, and D. Gefen, “ The Impact of
Personal Dispositions on Information Sensitivity, Privacy
Concern and Trust in Disclosing Health Information
Online,
” Decision Support Systems, Vol. 49(2), pp. 138-
150, 2010.
[4] Y. Chen, F. M. Zahedi, and A. Abbasi, “ Interface Design
Elements for Anti-phishing Systems, ” In Proc. Intl. Conf.
Design Science Research in Information Systems and
Technology, pp. 253- 265, 2011.
[5] S. Grazioli and S. L. Jarvenpaa, “ Perils of Internet Fraud:
An Empirical Investigation of Deception and Trust with
Experienced Internet Consumers
,” IEEE Trans. Systems,
Man, and Cybernetics Part A, vol. 20(4), pp. 395-410,
2000.
[6] APWG 2nd Quarter 2014 Phishing Activity Trends Report
[7] from www.antiphishing.org
Javelin Strategy and Research. http://www.javelinstrategy.com, 2012.
[8]
Rosana J. Ferolin, “
A Proactive Anti-Phishing Tool Using
Fuzzy Logic and RIPPER Data Mining Classification
Algorithm” , pp. 292-304, 2012.
.
ISSN: 2231-5381 http://www.ijettjournal.org
Page 301