Adaptive Classifier and Associative Algorithms for phishing detection

advertisement
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
Adaptive Classifier and Associative Algorithms for phishing
detection
Ch.sonika #1 , Mrs.D.Raaga Vamsi#2
1
2
M.Tech (CSE),Gudlavalleru Engineering College, Gudlavalleru
Professor, Gudlavalleru Engineering College, Gudlavalleru.
ABSTRACT:
Phishing serves as a social engineering crime generally
known as impersonating a trusted third party to gain access
to private data. Data Mining (DM) Techniques might be a
very useful methodology for identifying and detecting
phishtank phishing websites. Using this proposed system, we
present a novel approach to overcome the challenge and
complexity in detecting and predicting offline phishing data.
We proposed an intelligent effective model that really based
on using improved classification like improvedC4 5, PRISM
,PART and association Mining algorithms MCAR.This
strategy uses different classification algorithm and
techniques to extract the phishing training dataset to sort out
their legitimacy. We also compared their performances,
accuracy, range of rules generated. The rules generated direct
from associative classification model showed affiliation
between some important characteristics of phishtank data.
The experimental results shows
better performance
compared to other traditional classifications algorithms.
1 INTRODUCTION
stolen. These can allow the impression the fact that the
redirect is to the legitimate site (as an alternative to a spoofed
mock site) however in fact this is not always possible.
Currently, human reviewers maintain some blacklists, much
like the one published by PhishTank . With Phish- Tank, the
user communitymanually verifies potential phishing pages
submitted by community members to keep their blacklist
mostly error-free. Unfortunately, this review process takes a
considerable amount of time, ranging from a
median of over ten hours in March, 2009 to a median of over
fifty hours in June, 2009, according to PhishTank’s statistics.
Omitting verification to improve the timeliness of the data is
not a good option for PhishTank. Without verification, the
list would have many false positives coming from either
innocent confusion or malicious abuse. existing anti-phishing
techniques, whether third party certification based , password
based or URL based are not robust enough for phishing
detection.
The Phishing Phase can be described as a three step process:
Planning, Attack, and Fraud. Each action can be referred to
as shown below.
Phishing is an online form of pretexting, a kind of deception
in which an attacker pretends to be someone else in order to
obtain sensitive information from the victim. Phishing is a
significant practical problem, with reported accumulated loss
of $3.2 billion in 2007 . Due to the immediate monetary
rewards from the sensitive information stolen (e.g. user
account name and password), financial institutions such as
PayPal, eBay, and banks have been the primary brands
affected by phishing attacks. The typical communication
medium phishing attacks use is email, forged to look like it is
from a legitimate organization. The email usually informs
the victims of a problem with their account and directs them
to take remedial action by entering personal information or
logging into their account at a fake website. Although
phishing has already spread to other online media, such as
instant messaging, where phishing was first reported , and
voice-over-IP (also known as vishing), the focus of this study
is email-based phishing due to its prevalence and data
availability.
1. Planning. In this particular stage the opponent determines
the victim to attack; the knowledge to be got from the victim;
and how to obtain this information. Social techniques are
employed to gain information regarding the target victim.
Various media, by way of example phone, instant
messenging, clients, email, and the Internet, can be used to
gain this information.
Phishing: A social engineering-based spoofing attack
whereby a user is sent an email and spoofed into clicking on
a link to the bogus website where personal information
encompassng passwords and account numbers often is
This three-step process does not stop after one attack. It's a
continuing process wherein the attacker repeats the exact
steps with another unsuspecting victim.
2. Attack. This phase involves delivery of the phishing
message and luring the victim to surrender his/her
credentials. Email is a popular method utilized deliver the
phishing message to the target. The layout the most typical
phishing email is shown in Figure 1. This mail is targeted at
gaining financial information from eBay clients.
3. Fraud. The final step of the attacker is fraud. The attacker
uses the information obtained in the attack phase to purchase
goods, steal money from the victims account and identity
theft.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 610
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
Association rules mining is a technique for finding
interesting rules from transactional databases [1]. Finally it
was initially previously used to reveal associations in
commercial data issued from a database of transactions each
representing the set of items purchased by a customer. The
association analysis identifies items purchased together.
Association rules mining finds correlation between the
attributes. In our case, we are actually aiming to find the
correlation between the attributes inside of the network data.
As correlation differs between attributes of the network data,
association rules gives the flexibility of determining different
relations. The results of the mining are expressed as rules.
Phishing can still happen at sites that don't support twofactor authentication. Sensitive information that is not linked
to a particular site, e.g., credit card information and SSN,
cannot be protected by this approach either [20].
 Existing algorithm uses threshold methods in order
to classify the phishing results.

Another approach is to use security toolbars. The phishing
filter in IE7 [18] is naturally a toolbar approach that have
features encompassng blocking the user„s activity by having
detected phishing site. Other approach is to visually
differentiate the phishing sites direct from spoofed legitimate
sites. Dynamic Security [5] proposes to make use of a
randomly generated visual hash to customize the browser
window or web form elements to point the successfully
authenticated sites. A fourth approach is two- factor
authentication, which means the owner not exclusively
knows a secret but as well as supplies a security token [6].
However, this method is naturally a server-side solution.
in the existing
system.

Url is not normalized in this system.

Email messages and url’s are recognized are treated
2 BACKGROUND AND RELATED WORK
Gartner[2] estimated the costs at $1,244 per victim, a boost
during the $257 they cited in a 2004 report [1]. In 2007,
Moore and Clayton estimated how many phishing victims by
examining web server logs. They estimated that 311,449
people fall for phishing scams annually, costing around 350
million dollars [2]. There are numerous promising defending
approaches to this problem reported earlier. One approach
usually is to stop phishing for the email level [3], since the
majority of current phishing attacks use broadcast email
(spam) to lure victims to some phishing website. Another
approach is to use security toolbars. The phishing filter in
IE7 [4] is a toolbar approach that come with features an
example would be blocking the user's activity utilizing a
detected phishing site. A 3rd approach usually visually
differentiate the phishing sites direct from spoofed legitimate
sites. Dynamic Security Skins [5] proposes to utilize a
randomly generated visual hash to customize the browser
window or web form elements to point out the successfully
authenticated sites. A fourth approach is two-factor
authentication , which means the reader simply not only
knows a secret but in addition supplies a security token [6].
Many industrial antiphishing products use toolbars in Web
browsers, though some researchers have shown that security
tool bars don't effectively prevent phishing attacks. Yet
another approach usually employ certification, e.g.,
Microsoft spam privacy [7]. A variant of web credential
usually is to utilize a database or list published by your
trusted party, where known phishing web sites are
blacklisted. The weaknesses of this approach are its poor
scalability and its timeliness. The latest variety of Microsoft's
Internet Explorer supports Extended Validation (EV)
certificates, coloring the URL bar green and displaying the
name of all the company. However, legally to have found
that EV certificates didn't make users less are taken in by
phishing attacks [8].
Data collection is very difficult
as one in this system.

This system is outfitted if the data is large.

Existing algorithms are not suitable to all types of
attributes.
3. PROPOSED FRAMEWORK
\
Two publicly available datasets were used to test our
implementation: the "phishtank" from the phishtank.com [3]
which is considered one of the primary phishing-report
collections. The PhishTank database records the URL for the
suspected website that has been reported, the time of that
report, and sometimes further detail such as the screenshots
of the website, and is publicly availableWe used a series of
short scripts to programmatically extract the above features,
and store these in an excel sheet for quick reference.
The practical part of this comparative study utilizes six
different common OM classification algorithms (C4.5,
PART, PRISM and MCAR). Our choice of these methods is
founded on the different strategies they applied to learning
rules from data sets.
PART algorithm is based on on
account that it combines both approaches to generate a set of
rules. PRISM is naturally a classification rule that may only
trot out nominal attributes and doesn't do any pruning.
Finally, MCAR algorithm involves two phases: rules
generation and a classifier builder. In the initial phase,
MCAR scans the training data set to discover frequent single
items, after which recursively combines their products
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 611
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
generated to produce items involving more attributes. MCAR
then generates ranks and stores the rules. Supplied in the
second phase, the foundations are utilized to build a classifier
by considering their effectiveness on the training data set.
<url>http://200.32.240.252/ss2/?https://chaseonline.chase.com/publ
ic/login/Logon.aspx?LOB=COLLogon3002776sy266Y25d6t2t7jej6
6263DSHAhsd27ghG72GSH22s662gsg2</url>
<phish_id>1542621</phish_id>
<phish_detail_url>http://www.phishtank.com/phish_detai
l.php?phish_id=1542621</phish_detail_url>
<details>
<detail>
<ip_address>200.32.240.252</ip_address>
<cidr_block>200.32.240.0/21</cidr_block>
<announcing_network>10269</announcing_network>
<rir>lacnic</rir>
<detail_time>2012-08-27T05:29:49+00:00</detail_time>
</detail>
</details>
<submission>
<submission_time>2012-0827T05:29:19+00:00</submission_time>
</submission>
<verification>
The third step: Set r = 1, where r is used to retain the current
number of items in an itemset.
STEP 4: Generate the candidate set Cr 1 from Lr in just a
way a dead ringer for that in the Apriori algorithm except
that this supports of every the big ritemsets comprising each
candidate (r 1)-itemset Ik have to be larger than or
corresponding to the maximum (denoted as mIk) of one's
minimum supports of items during these large r-itemsets.
STEP 5: Calculate the count cIk just about every one of them
candidate (r 1)-itemset Ik in Cr 1, as its occurrence number
in the transactions; derive its support value sIk as:
STEP 6: Check whether the support sIk of each candidate (r
+1)-itemset Ik is larger than or equal to mIk (obtained in
STEP 4). If Ik satisfies the above condition, put it in the set
of large (r+1)-itemsets (Lr+1).
STEP 7: IF Lr+1 is null, do the next step; otherwise, set r =
r+1 and repeat STEPs 4 to 7.
STEP 8: Construct the association rules for each large qitemset Ik with items {Ik1 , Ik2 , . . . , Ikq} by the following
substeps:
(a) Form all possible associaton rules as follows:
Ik1  Ik2  Ik3  Ik4…..  Ikq --> Ikj ,j=1 to q
(b) Calculate the confidence values of all association
rules using forumula:
SIk
------------------------------SIk1……  Ikj-1  Ikj+1…  Ikq
STEP 9: Output the rules with confidence values larger than
or equal to the predefined confidence value lamda.
<verified>yes</verified>
IMPROVED C45:
<verification_time>2012-0827T06:38:45+00:00</verification_time>
</verification>
<status>
<online>yes</online></status>
Multiple Minimum Supports Using Maximum
Constraints Association rule mining algorithm:
INPUT: A set of n transaction data T, a set of p fields to be
detected, each feature is utilizing a minimum support value
mi, i = 1 to p, as well as a minimum confidence value.
Attribute Selection:
Apply Attribute selection to each attribute(L, attribute
list) to find the “best” splitting criterion; Gain
measures how well a given attribute separates training
examples into targeted classes. The one with the
highest information is selected. Given a collection S
of c outcomes The expected information needed to
classify a tuple in D is given by
In the kdd99 dataset we have two class labels ie
normal and anomaly.Hence
Modified
OUTPUT: A set of association rules in the whole criterion of
all the maximum values of minimum supports.
1st step:Appearance smell: Calculate the count ck just about
every one of them item tk, k=1 to p, as its occurrence number
in the transactions; derive its support value stk as:
Information
or
entropy
is
given
as
m
ModInfo(D)=  Si
 l og
Si ,m different classes
 l og
Si
i 1
2
ModInfo(D)=  Si
i 1
=  S1 log
S1  S2 log S2
Where S1 indicates set of samples which belongs
2nd step: Check whether the support stk of each and every
item tk is larger than or equivalent to its predefined
minimum support value mtk . If tk satisfies these activities
condition, call it in the whole set of large 1-itemsets (L1).
to target class ‘anamoly’, S2 indicates set of samples
which belongs to target class ‘normal’.
Information or Entropy to each attribute is calculated
using
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 612
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
v
InfoA ( D )   Di / D  ModInfo( Di )
i 1
The term Di /D acts as the weight of the jth partition.
ModInfo(D) is the expected information required to
classify a tuple from D based on the partitioning by A.
Information gain is defined as the difference between
the original information requirement) and the new
requirement .That is,
Correctly Classified Instances
Incorrectly Classified Instances
63
0
100 %
0 %
62
1
98.4127 %
1.5873 %
=== Confusion Matrix ===
a b c <-- classified as
21 0 0 | a = Czech Republic
0 20 0 | b = AZ, US
0 0 22 | c = US
Gain( A)  Mod inf o( D)  inf oA ( D)
=== Stratified cross-validation ===
Finding Best Split:
Correctly Classified Instances
Incorrectly Classified Instances
In order to decide which attribute is best split measure
,correlation coefficient is used as threshold as
=== Confusion Matrix ===
r   XY  XY
SDx . SD y
Let A= MaxGain{AttributeList
If(r>0 and A>r })
{
A is positively alerted and the node is selected.
}
Elseif(r<0 and A>r)
{
A is negatively alerted and the node is discarded.
}
Elseif(r=0 and A>r)
{
A is unalerted and next highest MaxGain is selected.
}
Else
A is discarded
Depending on the alert type severity, the decision on
the root node and the child nodes are selected.
Recurs on the sub lists obtained by splitting on a best, and
add those nodes as children of node.
4 Experimental Results:
All experiments were performed with the configurations
Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the
operating system platform is Microsoft Windows XP
Professional (SP2).
Improved C45 SPAM tree
Number of Leaves :
3
Size of the tree : 5
a b c <-- classified as
20 0 1 | a = Czech Republic
0 20 0 | b = AZ, US
0 0 22 | c = US
MultiClassClassifier
Classifier 1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Time taken to build model: 0.22 seconds
Time taken to test model on training data: 0.02 seconds
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
21
3
87.5 %
12.5 %
14
10
58.3333 %
41.6667 %
=== Confusion Matrix ===
a b <-- classified as
14 2 | a = YES
1 7 | b = NO
=== Stratified cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
=== Confusion Matrix ===
a b <-- classified as
10 6 | a = YES
4 4 | b = NO
Time taken to build model: 0.14 seconds
Time taken to test model on training data: 0.05 seconds
=== Error on training data ===
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 613
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
PART decision list
URL of Web Content: =http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorpSignUp=&path=gp=Fyourst
orehome& redirectProtocol=&mode=&
useRedirectOnSuccess=1& query=signIn.htm: NO (8.0/3.0)
Apparent Sender = ABBY: YES (12.0/1.0)
Location: = Czech Republic: NO (2.0)
: YES (2.0)
Number of Rules :
4
Time taken to build model: 0.13 seconds
Time taken to test model on training data: 0.02 seconds
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
20
4
83.3333 %
16.6667 %
=== Confusion Matrix ===
a b <-- classified as
13 3 | a = YES
1 7 | b = NO
=== Stratified cross-validation ===
Correctly Classified Instances
15
Incorrectly Classified Instances
9
Kappa statistic
0.069
If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk
and Email Format = HTML
and Apparent Sender = Amazon
and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp
SignUp=&path=gp=Fyourstorehome&
redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm
and Location: = US
and Detailed server information = 209.61.245.164 detailed server
information then YES
If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk
and Email Format = XML then NO
If Apparent Sender = Amazon
and Return Address: = account-update@ ABBY.co.uk then NO
If URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp
SignUp=&path=gp=Fyourstorehome&
redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm
and Email Format = HTML
and Apparent Sender = ABBY then NO
If Return Address: = amazon.co.uk < auto-confirm@ amazon.co.uk
and Apparent Sender = Amazon
and Email Format = HTML
and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp
SignUp=&path=gp=Fyourstorehome&
redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm
and Location: = US
and Detailed server information = 209.61.245.164 detailed server
information then NO
If Return Address: = ABBY.co.uk < auto-confirm@ ABBY.co.uk
and Apparent Sender = Amazon
and Email Format = HTML
and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp
SignUp=&path=gp=Fyourstorehome&
redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm
and Location: = US
and Detailed server information = 209.61.245.164 detailed server
information then NO
If URL of Web Content: = http://72.167.205.59/gp/index.php
and Apparent Sender = ABBY
and Email Format = HTML then NO
Time taken to build model: 0.04 seconds
Time taken to test model on training data: 0.02 seconds
62.5 %
37.5 %
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
22
2
91.6667 %
8.3333 %
14
6
58.3333 %
25 %
=== Confusion Matrix ===
a b <-- classified as
13 3 | a = YES
6 2 | b = NO
=== Confusion Matrix ===
a b <-- classified as
16 0 | a = YES
2 6 | b = NO
Prism rules
If URL of Web Content: = http://ordersuk.szm.sk/sing. varzea-loginamazon.htm then YES
If Detailed server information = 56.167.205.59 detailed server information
then YES
If Return Address: = account-update1@ amazon.co.uk then YES
If Return Address: = account-update1@ ABBY.co.uk then YES
If URL of Web Content: = http://72.167.25.59/gp/index.php then YES
If Apparent Sender = ABBY
and Email Format = XML then YES
If Return Address: = ABBY.co.uk < auto-confirm@ ABBY.co.uk
and Apparent Sender = Amazon
and Email Format = HTML
and URL of Web Content: = http://209.61.245.164/board/gp/signin=UTF8&email=&disableCorp
SignUp=&path=gp=Fyourstorehome&
redirectProtocol=&mode=& useRedirectOnSuccess=1& query=signIn.htm
and Location: = US
and Detailed server information = 209.61.245.164 detailed server
information then YES
=== Stratified cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
=== Confusion Matrix ===
a b <-- classified as
10 3 | a = YES
3 4 | b = NO
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 614
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
150
100
correctly
incorrectly
50
0
PRIS
MCA IMPR
PART
M
R
OVE
correc 91.67 83.33 87.5
tly
100
incorre 8.333 16.67 12.5
ctly
0
Proposed algorithm performance evaluation
5. CONCLUSION AND FUTURE WORK
This proposed work mainly identifies several new and
generic features for identifying phishing URLs. Proposed
Improvedc45 classifier and MCAR achieves a very high
accuracy. One of the major contributions of this work is a
large scale measurement study conducted on phishtank web
urls. Experimental results shows basic identification of
phishing attacks and non phishing attacks based on the
features of different types of features. Data mining
algorithms require an offline training phase, but the
testing phase requires much less time and future work
could investigate how well it can be adapted to
performing online phishing web classification.
[6] Bing Liu, Wynne Hsu, Yiming Ma, "Integrating
Classification and Association Rule Mining." Proceedings
ofthe Fourth International Conference on Knowledge
Discovery and Data Mining (KDD-98, Plenary Presentation),
New York, USA, 1998.
[7] '1', Fadi, c.Peter and Y. Peng, "MCAR: Multi-class
Classification based on Association Rule", IEEE
International Conference on Computer Systems and
Applications ,2005, pp. 127-133.
[8] WEKA - University of Waikato, New Zealand, EN,
2006: "Weka -Data Mining with Open Source Machine
Learning Software in Java", 2006 .
[9] A. Y. Fu, L. Wenyin and X. Deng, ― Detecting Phishing
Web Pages with Visual Similarity Assessment Based on
Earth Mover„s Distance (EMD) , IEEE transactions on
dependable and secure computing, vol. 3, no. 4, 2006.
[10] B. Adida, S. Hohenberger and R. Rivest , "Lightweight
Encryption for Email," USENIX Steps to Reducing
Unwanted Traffic on the Internet (SRUTI), 2005 .
[11]
T.
Sharif,
"Phishing
Filter
in
IE7,"
http://blogs,msdn.com/ie/archive/2005 /09/09/463204,as
px,,2006,
REFERENCES:
[1] R. Dhamija and J.D. Tygar, ―The Battle against
Phishing: Dynamic Security Skins, Proc. Symp. Usable
Privacy and Security, 2005.
[2] Real Time and Offline Network Intrusion Detection using
Improved Decision Tree Algorithm G. Sunil Kumar,
International Journal of Computer Applications (0975 –
8887) Volume 48– No.25, June 2012
[3] FDIC., ―Putting an End to Account-Hijacking Identity
Theft, http://www.fdic.gov/consumers/consumer/idtheftstud
y/identity_theft.pdf, 2004.
[4] Associative Classification Techniques for predicting eBanking Phishing Websites , Maher Aburrous.
[12] J. Cendrowska., "PRISM: An algorithm for inducing
modular rule", International Journal of Man-Machine Studies
(1987), Vo1.27, No.4, pp.349-370.
[5] J. R. Quinlan, "Improved use of continuous attributes in
c4.5", Journal of Artificial Intelligence Research, 4:7790,
1996.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 615
Download