PhishNet: Predictive
Blacklisting to detect
Phishing Attacks
Reporter: Gia-Nan Gao
Advisor: Chin-Laung Lei
2010/4/26
1
Reference

Pawan Prakash, Manish Kumar, Ramana
Rao Kompella and Minaxi Gupta,
“PhishNet: Predictive Blacklisting to
Detect Phishing Attacks,” in IEEE
INFOCOM 2010.
2
Outline
Introduction
 Two Major Components of PhishNet

◦ URL prediction component
◦ Approximate URL matching component
Evaluation
 Conclusion

3
Introduction

Phishing attacks
◦ Set up fake web sites mimicking real
businesses in order to lure innocent users
into revealing sensitive information

Blacklisting
◦ Match a given URL with a list of URLs
belonging to a blacklist

Problem of blacklisting
◦ Malicious URLs cannot be known before a
certain amount of prevalence in the wild
4
Two Major Components of
PhishNet

URL prediction component
◦ Generate new URLs (child) from known phishing
URLs (parent) by employing various heuristics
◦ Test whether the new URLs generated are indeed
malicious

Approximate URL matching component
◦ Perform an approximate match of a new URL with
the existing blacklist
5
Component 1: Heuristics for
Generating New URLs

Typical blacklist URLs structure
◦ http://domain.TLD/directory/filename?query
string
H1: Replacing TLDs
 H2: IP address equivalence
 H3: Directory structure similarity
 H4: Query string substitution
 H5: Brand name equivalence

6
Heuristics for Generating New
URLs

H1: Replacing TLDs
◦ 3, 210 effective top-level domains (TLDs)
◦ Replace the effective TLD of the parent URL
with 3, 209 other effective TLDs

H2: IP address equivalence
◦ Phishing URLs having same IP addresses are
grouped together into clusters
◦ Create new URLs by considering all
combinations of hostnames and pathnames
7
Heuristics for Generating New
URLs (cont’d)

H3: Directory structure similarity
◦ URLs with similar directory structure are grouped
together
◦ Build new URLs by exchanging the filenames among
URLs belonging to the same group
◦ Parent
 www.abc.com/online/signin/paypal.htm
www.xyz.com/online/signin/ebay.htm
◦ Child
 www.abc.com/online/signin/ebay.htm
www.xyz.com/online/signin/paypal.htm
8
Heuristics for Generating New
URLs (cont’d)

H4: Query string substitution
◦ Build new URLs by exchanging the query
strings among URLs
◦ Parent
 www.abc.com/online/signin/ebay?XYZ
 www.xyz.com/online/signin/paypal?ABC
◦ Child
 www.abc.com/online/signin/ebay?ABC
 www.xyz.com/online/signin/paypal?XYZ
9
Heuristics for Generating New
URLs (cont’d)

H5: Brand name equivalence
◦ Build new URLs by substituting brand names
occurring in phishing URLs with other brand
names
10
Component 1: Verification


Conduct a DNS lookup to filter out sites that cannot
be resolved
For each of the resolved URLs
◦ Try to establish a connection to the corresponding server

For each successful connection
◦ Initiate a HTTP GET request to obtain content from the server

If the HTTP header from the server has status code
200/202 (successful request)
◦ Perform a content similarity between the parent and the child
URLs

If the URL’s content has sharp resemblance (above say
90%) with the parent URL
◦ Conclude that the child URL is a bad site
11
Component 2: Approximate
Matching

Determine whether a given URL is a phishing
site or not
12
M1: Matching IP Address



Perform a direct match of the IP address of URL with
the IP addresses of the blacklist entries
Assign a normalized score based on the number of
blacklist entries that map to a given IP address
If IP address IPi is common to ni URLs
min{ni} (max{ni}): the minimum (maximum) of the
number of phishing URLs hosted by blacklisted entries
of IP addresses
13
M2: Matching Hostname
Perform hostname match with those in
the blacklist
 Domains of phishing URLs

◦ Specifically registered for hosting phishing
sites
◦ Hosted on free/paidfor web-hosting services
(WHS)

Identify whether an incoming URL
consists of a WHS or not
◦ Matching WHSes
◦ Matching non-WHSes
14
M2: Matching Hostname (cont’d)
15
M3: Matching Directory Structure

Perform directory structure match with
those in the blacklist

Philosophy of this design
◦ H3 (directory structure similarity)
◦ H4 (query string substitution)
 ni:
the number of URLs corresponding to
a directory structure
16
M4: Matching Brand Names

Check for existence of brand names in
pathname and query string of URLs
 n i:
the number of occurrences of the
brand name

Compute a final cumulative score
◦ Assign different weights to different modules
17
Evaluation: Component 1

Collect 6,000 URLs from PhishTank (2009/7/2
~ 2009/7/25)
18
Evaluation: Component 2
How many benign (malicious) sites are
(not) flagged as malicious
 Data source

◦ Phishing URLs
 PhishTank (consists of about 18, 000 URLs)
 SpamScatter (14, 000 URLs)
◦ Benign URLs
 DMOZ (100, 000 benign URLs )
 20, 000 benign URLs from Yahoo Random URL
generator (YRUG)
19
Evaluation: Component 2 (cont’d)

Training phase
◦ Create various data structures using the
phishing URLs

Testing phase
◦ An input URL is flagged as a phishing or a
benign site

Weight of individual modules
◦ W(M1, M2, M3, M4) = (1.0, 1.0, 1.5, 1.5)
20
Evaluation: Component 2 (cont’d)
21
Conclusion

Address major problems associated with
blacklists

Two major components of PhishNet
◦ URL prediction component
◦ Approximate URL matching component

Flag new URLs effectively
22
Download

PhishNet: Predictive Blacklisting to Detect Phishing Attacks