Phishing Web Pages Detection

advertisement
Phishing Webpage Detection
Jau-Yuan Chen
COMS E6125 WHIM
March 24, 2009
What is “Phishing”?
• Source: "Phishing Activity Trends Report," APWG,
December 2008
• APWG: Anti-Phishing Working Group
• (Definition)
– Phishing is a criminal mechanism employing both social engineering and technical subterfuge to steal consumers’ personal identity
data and financial account credentials.
– Social‐engineering schemes use spoofed e‐mails purporting to be
from legitimate businesses and agencies to lead consumers to
counterfeit websites designed to trick recipients into divulging
financial data such as usernames and passwords.
– Technical‐subterfuge schemes plant crimeware onto PCs to steal
credentials directly, often using systems to intercept consumers
online account user names and passwords ‐ and to corrupt local
navigational infrastructures to misdirect consumers to counterfeit
websites (or authentic websites through phisher‐controlled
proxies used to monitor and intercept consumers’ keystrokes).
March 23, 2016
2
Severity of the “Phishing” Problem
• The number of crimeware-spreading sites infecting
PCs with password-stealing crimeware reached an all
time high of 31,173 in December, 2008.
• Unique phishing reports submitted to APWG
recorded a yearly high of 34,758 in December, 2008.
• in 2007 (a survey by Gartner, Inc.)
– more than $3.2 billion was lost to phishing attacks in the US
– 3.6 million adults lost money in phishing attacks
March 23, 2016
3
WHY PHISHING PAGE DETECTION?
March 23, 2016
4
It’s difficult to distinguish
these pages!
March 23, 2016
5
Most Targeted Industry
March 23, 2016
6
Current Anti-phishing Solutions
• text-based page analysis
– URL analysis
– HTML parsing
– keyword extraction
• however, phishers can easily avoid detection by using
non-html components, such as
– images,
– Flash,
– ActiveX, etc.
March 23, 2016
7
Image-based Anti-phishing Scheme
focus on "what you see",
not "how the page is composed"!
J.-Y. Chen, and K.-T. Chen, “A Robust Local Feature-based Scheme for Phishing
Page Detection and Discrimination,” Web 2.0 Trust 2008.
K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting Phishing with
Discriminative Keypoint Features of Webpages,” IEEE Internet Computing, to
appear.
March 23, 2016
8
Page Matching
Image-based
Page Matching
March 23, 2016
Page Scoring
Page Classification
9
Page Scoring
Image-based
Page Matching
a successful match
Page Scoring
Page Classification
effective grids
March 23, 2016
10
Page Classification
Image-based
Page Matching
Page Scoring
Page Classification
• naïve Bayesian classifier with 10-fold cross-validation
• training data
– a pre-stored phishing page set & a legitimate page set
– phishing page set (positive data set)
• comparisons between phishing pages and their target pages
– legitimate page set (negative data set)
• comparisons between legitimate pages of different sites
March 23, 2016
11
PERFORMANCE EVALUATION
March 23, 2016
12
Data description
• phishing pages: 2,058 pages on 74 sites
– source: http://www.phishtank.com, http://www.antiphishing.org
– records of top 5 phishing target sites are more than half of our records
Domain
Number of Records
eBay
701
PayPal
632
Marshall & Ilsley
138
Charter One
116
Bank of America
51
• potential target pages: 300 vulnerable pages
– source: http://www.ciphertrust.com/resources/statistics/
• pre-stored data set
– positive: 2,058 comparisons
– negative: 44,000 comparisons
March 23, 2016
13
Earth Mover’s Distance (EMD) based Scheme
•
•
•
•
Fu et al., IEEE Trans. on Dependable & Secure Computing, 2006
the 1st image-based phishing detecting approach
to evaluate the distance between two signatures
Signature (S)
– the frequency and the centroid of each color used
• Weight (p, q)
– a linear combination of the Euclidian distance and the centroids of colors
• Visual similarity degree (VSD)
– VSD = 1 – (EMD)α
• pros: simple and fast
• cons: only suitable for basic phishing cases
– it tends to fail if phishing pages and the official ones are partially similar
– however, phishing pages are usually partially different from their targets!
March 23, 2016
14
Parameter Settings
• CCH settings
– levels to describe salient points (L) = 4
– Euclidean distance between two salient points (Dist) = 7 pixels
– input image size: original webpage resolution (mostly 800 × 600)
– k-means parameter (k) = 4
– naïve Bayesian classifier
• EMD settings
– we follow the suggestion in Fu et al.'s previous work
– input image size: 100 × 100 (Lanczos3 resampling algorithm)
– color degrading factor (CDF): 32
– amplifier for the EMD value (α): 0.5
– the # of colors used for the signature (|Ss|): 20
– the weight for the color distance (p): 0.5
– the weight for the color centroid distance (q): 0.5
– naïve Bayesian classifier is used instead of per-page threshold
March 23, 2016
15
• Top 5 Phishing Target Sites
– AUC
• CCH: 0.998
• EMD: 0.956
March 23, 2016
16
• Impact of Image Size on Computation Time
March 23, 2016
17
Conclusions
• We proposed an image-based phishing detection
technique with local features.
• Our experimental results show that we have
– an over 96% successful phishing recognition rate, and
– less than 0.30 second per phishing identification on average.
• Our experiments show that local features are more
suitable than global information for phishing page
detection.
March 23, 2016
18
THANK YOU!
Download