Comparison of Classifier for Anti- Phishing Techniques Pradeep Tiwari

advertisement

International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016

Comparison of Classifier for Anti- Phishing

Techniques

Pradeep Tiwari

#1

, Ravendra Ratan Singh

2

Abstract — In today’s technical world each one of us is connected with each other through various forms of communication medium. One among them is internet. Internet is used to communicate with friends and family. These days most of the business transactions are also executed via internet. People make use of internet for paying bills, making online shopping, etc. They keep their information like address, credit card numbers, telephone, etc. saved for transactions. Many hackers across the world utilize these kinds of services offered by internet to carry out many kinds of cyber-attacks. One such attack is a Phishing attack. In a phishing attack a user gets emails from attackers as if the mails are coming from legitimate organizations and asks the user to undergo a transaction. If the user is unaware of such kinds of attacks carries out the transaction and becomes a victim of the phishing attack. Such attacks have become common and huge sums of money have been lost by users across the world.

Thus, researchers across the world have come up with many anti- phishing techniques. This paper analyses a phishing database record to understand the phishing patterns for a website. Based on my analysis I make use of SVM based classifier, Naïve

Bayes classifier and Random forest based classifier to evaluate the best classifier for the anti- phishing methods. To do so I make use of Weka, R script and data mining techniques.

Keywords — Phishing, data mining, R Script,

WEKA.

I.

#

Master of Technology Scholar & Department of CSE & SITM, Lucknow

I NTRODUCTION prior hostile to phishing frameworks, a percentage of the commercial enterprises have connected these frameworks to shield their association from phishing assault.

In phishing assault, the aggressor makes the site pages that resemble a copy site page of the real sites.

Out of the proposed hostile to phishing strategies, some depend on the machine learning or information mining calculations.

Despite the fact that Web programs (i.e. Mozilla

Firefox, Internet Explorer, Opera, Google Chrome, and so on.) give add-on instrument to blocking phishing messages and phishing destinations,

Phishers still figure out how to override these security systems. Phishing has turned out to be increasingly confounded that Phishers can sidestep the channel set by current hostile to phishing methods [1]. The quickly expanding number of phishing assaults recommends that it is in this way hard to locate a solitary coherent methodology to identify phishing messages and that current hostile to phishing apparatuses are not adequate. This might be ascribed to the for the most part aloof approach of hostile to phishing strategies. The methodologies are aloof since they don't stop the wellspring of the phishing messages rather they essentially group and recognize phishing messages.

The data mining calculations are applied on the arrangement of possible components of phishing that can be separated from the site. As per these elements, the phishing issue can be explained by selecting legitimate/right arrangement of components. In this investigation of anti- phishing, diverse information mining calculations have been applied on the information set of phishing and genuine sites to compare the accuracy of the classifier.

The term Phishing is a sort of mocking site which is utilized for taking delicate and critical data of the web client, for example, web saving money passwords, charge card data and client's secret word and so on. In the phishing assault, the aggressor creates the notice message to the client about the security issues, request secret data through phishing messages, request that overhaul the client's record data and so forth. A few exploratory outline contemplations have been proposed before to countermeasure the phishing assault. The prior frameworks are not giving more than 90 rate effective results. At times, the framework apparatus gives just 50-60 rate effective result [4].

A few exploratory outline contemplations have been proposed in before study to countermeasure the phishing assault. On the premise of the execution of

II.

R ELATED W ORK

There are distinctive existing phishing identification approaches. These methodologies can be further named 1) content-based methodologies that utilization site substance to recognize phishing,

2) non-content based methodologies that don't utilize the substance of the site to distinguish if the email is a credible or phishing email and 3) Visual based methodologies that recognize phishing utilizing the likeness of known destinations through visual assessment [2].

ISSN: 2231-5381 http://www.ijettjournal.org

Page 298

International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016

A.

Content Based Approach

In content based approach, phishing attacks are distinguished by looking at site substance.

Highlights utilized as a part of this methodology incorporate catchphrases, spelling mistakes, joins, secret key fields, installed joins, and so forth alongside URL and host based elements [5].

Google's against examining so as to phishing channel identifies phishing and malware page URL, page rank, WHOIS data and substance of a page including HTML, JavaScript, pictures, iframe, and so on [5]. The classifier is always redesigned to suit new phishing locales to adapt up to the most recent systems in phishing assaults. In this approach the classifier might have higher precision however the outcome is not constant. It is utilized logged off since it takes more time to recognize the Phishing

[7]. A few specialists have investigated distinctive methodologies, for example, fingerprinting, chief part examination of pictures, heuristic methodologies and fluffy rationale among others and fluffy rationale based ways to deal with distinguish phishing locales. Our methodology utilizes Fuzzy

Logic dialect descriptors with a scope of qualities for each recognized phishing trademark particularly spelling mistakes, catchphrases and implanted connections. The participation capacity for every trademark inferred as is utilized to evaluate the likelihood that the email is a phishing email [2].

B.

Non-Content Based Approach

Non-content based approaches are essentially based with respect to URL and host data arrangement. URLs are normally ordered in view of components, for example, URL address length and vicinity of unique characters. Besides, have components of URL, for example, IP address, site proprietor, DNS properties and topographical properties are additionally utilized as a part of the order of Phishing messages [5]. The achievement rate is between 95% - 99% even progressively handling [6].

III.

M ETHODOLOGY

At the point when a web client gets to the site, the client hit web address on URL or came to the objective site page from some other site reference joins. For this situation, above all else the URL and its substance ought to be checked then the substance and existing pictures ought to be checked [8]. To check the different purposes of the site, it takes enough time to count the site data with the database data put away in the database of the practical Addon of the web program. In the prior study, program based customer side arrangements have been proposed to alleviate the phishing assaults [2]. A few strategies have likewise been created which endeavor to keep phishing sends from being conveyed [3]. So we ought to have a framework that can immediately check the fed data of the client with the database data of the framework while client sustain the classified data in the site. To make the quick getting to framework, we have characterized the study focuses for the most ideal arrangement.

The contemplated criteria for the phishing have been gathered from the past study [5]. Following are the study points:

TABLE I

ATTRIBUTE AND COLUMN NAME OF PHISHING

WEBSITE DATASET

Attribute

Having IP

Address

Values

{ 1,0 }

Column Name has_ip

Having long url { 1,0,-

1 }

Uses

ShortningService long_url

{ 0,1 } short_service

Having '@'

Symbol

{ 0,1 } has_at

Double slash redirecting

Having Prefix

Suffix

Having Sub

Domain

SSLfinal State

{ 0,1 }

{ -

1,0,1 }

{ -

1,0,1 } double_slash_redirect pref_suf has_sub_domain

Domain registeration length

{ -

1,1,0 }

{ 0,1,-

1 } ssl_state long_domain

Favicon { 0,1 } Favicon

Is standard Port { 0,1 } Port

Uses HTTPS token

{ 0,1 } https_token

Request_URL

Abnormal URL anchor

Links_in_tags

SFH

Submitting to email

{ 1,-1 } req_url

{ -

1,0,1 }

{ 1,-

1,0 } url_of_anchor tag_links

{ -1,1 } SFH

{ 1,0 } submit_to_email

Abnormal URL { 1,0 } abnormal_url

Redirect { 0,1 } Redirect on mouseover

Right Click

{ 0,1 }

{ 0,1 }

Mouseover right_click popUp Window { 0,1 } Popup

Iframe { 0,1 } Iframe

Age of domain { -

1,0,1 } domain_age

DNS Record

Web traffic

Page Rank

Google Index

Links pointing to page

{ 1,0 } dns_record

{ Traffic

1,0,1 }

{ -

1,0,1 } page_rank

{ 0,1 } google_index

{ 1,0,-

1 } links_to_page

ISSN: 2231-5381 http://www.ijettjournal.org

Page 299

International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016

Statistical report { 1,0 } stats_report

Result { 1,-1 } Target

Fig. 1 Steps for the validating the classifiers

IV.

R ESULT

The given data set is validated to check the performances of the various classifiers. On the basis of evaluation of the proposed methodology using

WEKA and R-Script following results are obtained.

A.

Naïve bayes classifier output

Time taken to build model: 0.02 seconds

Stratified cross-validation

Correctly Classified Instances 2281 92.8746 %

Incorrectly Classified Instances 175 7.1254 %

Kappa statistic 0.8566

Mean absolute error 0.0843

Root mean squared error 0.2434

Relative absolute error 17.0559 %

Root relative squared error 48.9701 %

Coverage of cases (0.95 level) 96.9055 %

Mean rel. region size (0.95 level) 57.1254 %

Total Number of Instances 2456

Detailed Accuracy by Class

TP Rate FP Rate Precision Recall F-Measure

ROC Area Class

0.949 0.087 0.897 0.949 0.922 0.975 1

0.913 0.051 0.957 0.913 0.934 0.975 -1

Weighted Avg. 0.929 0.067 0.93 0.929

0.929 0.975

Confusion Matrix a b <-- classified as

1038 56 | a = 1

119 43 | b = -1

B.

SVM radial kernel based Classifier

Confusion Matrix and Statistics Reference

Prediction -1 1

-1 332 13

1 8260

Accuracy:

CI :

0.9657 95%

(0.9481, 0.9787)

No Information Rate: 0.5546

P-Value [Acc > NIR] : <2e-16

Kappa: 0.9305

Mcnemar's Test P-Value: 0.3827

Sensitivity:

Specificity:

Pos Pred Value:

Neg Pred Value:

0.9765

0.9524

0.9623

0.9701

Prevalence:

Detection Rate:

0.5546

0.5416

Detection Prevalence: 0.5628

Balanced Accuracy:

'Positive' Class:

0.9644

-1

C.

Random forest based

Confusion Matrix and Statistics Reference

Prediction -1 1

-1 331 10

1 9 263

Accuracy: 0.963 95%

CI: (0.952, 0.9812)

No Information Rate: 0.5546

P-Value [Acc > NIR] : <2e-16

Kappa: 0.9372

Mcnemar's Test P-Value: 1

Sensitivity: 0.9735

Specificity:

Pos Pred Value:

Neg Pred Value:

0.9634

0.9707

0.9669

Prevalence:

Detection Rate:

0.5546

0.5400

Detection Prevalence: 0.5563

Balanced Accuracy: 0.9684

'Positive' Class: -1

ISSN: 2231-5381 http://www.ijettjournal.org

Page 300

International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 6- April 2016

Fig. 2 Overall statistics of the given classifier is given in figure below Figure Labels

V.

C ONCLUSIONS

The phishing attack, the user sends their confidential information on mimic websites, so the user should be informed immediately about the category of website. The aim of this research work is to predict whether the given URL is a phishing website or not. This work collects the dataset of UCI machine learning dataset and creates a R script and uses interface of WEKA to evaluate various types of classifier over the given dataset.

The result concludes that SVM based classifier and Random forest based classifier are the best classifier which can be employed in making an anti-

Phishing inbuilt tool or function for any web browser.

R EFERENCES

[1] Ahmed Abbasi, Fatemeh “Mariam” Zahedi and Yan Chen,

Impact of Anti-Phishing Tool Performance on Attack

Success Rates”,

10th IEEE International Conference on

Intelligence and Security Informatics (ISI) Washington,

D.C., USA, June 11-14, 2012.

[2] A. Abbasi and H. Chen,

“A Comparison of Fraud Cues and Classification Methods for Fake Escrow Website

Detection,”

Information Technology and Management,

Vol. 10(2), pp. 83-101, 2009.

[3] G. Bansal, F. M. Zahedi, and D. Gefen, “ The Impact of

Personal Dispositions on Information Sensitivity, Privacy

Concern and Trust in Disclosing Health Information

Online,

” Decision Support Systems, Vol. 49(2), pp. 138-

150, 2010.

[4] Y. Chen, F. M. Zahedi, and A. Abbasi, “ Interface Design

Elements for Anti-phishing Systems, ” In Proc. Intl. Conf.

Design Science Research in Information Systems and

Technology, pp. 253- 265, 2011.

[5] S. Grazioli and S. L. Jarvenpaa, “ Perils of Internet Fraud:

An Empirical Investigation of Deception and Trust with

Experienced Internet Consumers

,” IEEE Trans. Systems,

Man, and Cybernetics Part A, vol. 20(4), pp. 395-410,

2000.

[6] APWG 2nd Quarter 2014 Phishing Activity Trends Report

[7] from www.antiphishing.org

Javelin Strategy and Research. http://www.javelinstrategy.com, 2012.

[8]

Rosana J. Ferolin, “

A Proactive Anti-Phishing Tool Using

Fuzzy Logic and RIPPER Data Mining Classification

Algorithm” , pp. 292-304, 2012.

.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 301

Download