Phishing Webpage Detection Jau-Yuan Chen COMS E6125 WHIM March 24, 2009 What is “Phishing”? • Source: "Phishing Activity Trends Report," APWG, December 2008 • APWG: Anti-Phishing Working Group • (Definition) – Phishing is a criminal mechanism employing both social engineering and technical subterfuge to steal consumers’ personal identity data and financial account credentials. – Social‐engineering schemes use spoofed e‐mails purporting to be from legitimate businesses and agencies to lead consumers to counterfeit websites designed to trick recipients into divulging financial data such as usernames and passwords. – Technical‐subterfuge schemes plant crimeware onto PCs to steal credentials directly, often using systems to intercept consumers online account user names and passwords ‐ and to corrupt local navigational infrastructures to misdirect consumers to counterfeit websites (or authentic websites through phisher‐controlled proxies used to monitor and intercept consumers’ keystrokes). March 23, 2016 2 Severity of the “Phishing” Problem • The number of crimeware-spreading sites infecting PCs with password-stealing crimeware reached an all time high of 31,173 in December, 2008. • Unique phishing reports submitted to APWG recorded a yearly high of 34,758 in December, 2008. • in 2007 (a survey by Gartner, Inc.) – more than $3.2 billion was lost to phishing attacks in the US – 3.6 million adults lost money in phishing attacks March 23, 2016 3 WHY PHISHING PAGE DETECTION? March 23, 2016 4 It’s difficult to distinguish these pages! March 23, 2016 5 Most Targeted Industry March 23, 2016 6 Current Anti-phishing Solutions • text-based page analysis – URL analysis – HTML parsing – keyword extraction • however, phishers can easily avoid detection by using non-html components, such as – images, – Flash, – ActiveX, etc. March 23, 2016 7 Image-based Anti-phishing Scheme focus on "what you see", not "how the page is composed"! J.-Y. Chen, and K.-T. Chen, “A Robust Local Feature-based Scheme for Phishing Page Detection and Discrimination,” Web 2.0 Trust 2008. K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting Phishing with Discriminative Keypoint Features of Webpages,” IEEE Internet Computing, to appear. March 23, 2016 8 Page Matching Image-based Page Matching March 23, 2016 Page Scoring Page Classification 9 Page Scoring Image-based Page Matching a successful match Page Scoring Page Classification effective grids March 23, 2016 10 Page Classification Image-based Page Matching Page Scoring Page Classification • naïve Bayesian classifier with 10-fold cross-validation • training data – a pre-stored phishing page set & a legitimate page set – phishing page set (positive data set) • comparisons between phishing pages and their target pages – legitimate page set (negative data set) • comparisons between legitimate pages of different sites March 23, 2016 11 PERFORMANCE EVALUATION March 23, 2016 12 Data description • phishing pages: 2,058 pages on 74 sites – source: http://www.phishtank.com, http://www.antiphishing.org – records of top 5 phishing target sites are more than half of our records Domain Number of Records eBay 701 PayPal 632 Marshall & Ilsley 138 Charter One 116 Bank of America 51 • potential target pages: 300 vulnerable pages – source: http://www.ciphertrust.com/resources/statistics/ • pre-stored data set – positive: 2,058 comparisons – negative: 44,000 comparisons March 23, 2016 13 Earth Mover’s Distance (EMD) based Scheme • • • • Fu et al., IEEE Trans. on Dependable & Secure Computing, 2006 the 1st image-based phishing detecting approach to evaluate the distance between two signatures Signature (S) – the frequency and the centroid of each color used • Weight (p, q) – a linear combination of the Euclidian distance and the centroids of colors • Visual similarity degree (VSD) – VSD = 1 – (EMD)α • pros: simple and fast • cons: only suitable for basic phishing cases – it tends to fail if phishing pages and the official ones are partially similar – however, phishing pages are usually partially different from their targets! March 23, 2016 14 Parameter Settings • CCH settings – levels to describe salient points (L) = 4 – Euclidean distance between two salient points (Dist) = 7 pixels – input image size: original webpage resolution (mostly 800 × 600) – k-means parameter (k) = 4 – naïve Bayesian classifier • EMD settings – we follow the suggestion in Fu et al.'s previous work – input image size: 100 × 100 (Lanczos3 resampling algorithm) – color degrading factor (CDF): 32 – amplifier for the EMD value (α): 0.5 – the # of colors used for the signature (|Ss|): 20 – the weight for the color distance (p): 0.5 – the weight for the color centroid distance (q): 0.5 – naïve Bayesian classifier is used instead of per-page threshold March 23, 2016 15 • Top 5 Phishing Target Sites – AUC • CCH: 0.998 • EMD: 0.956 March 23, 2016 16 • Impact of Image Size on Computation Time March 23, 2016 17 Conclusions • We proposed an image-based phishing detection technique with local features. • Our experimental results show that we have – an over 96% successful phishing recognition rate, and – less than 0.30 second per phishing identification on average. • Our experiments show that local features are more suitable than global information for phishing page detection. March 23, 2016 18 THANK YOU!