PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26 1 Reference Pawan Prakash, Manish Kumar, Ramana Rao Kompella and Minaxi Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks,” in IEEE INFOCOM 2010. 2 Outline Introduction Two Major Components of PhishNet ◦ URL prediction component ◦ Approximate URL matching component Evaluation Conclusion 3 Introduction Phishing attacks ◦ Set up fake web sites mimicking real businesses in order to lure innocent users into revealing sensitive information Blacklisting ◦ Match a given URL with a list of URLs belonging to a blacklist Problem of blacklisting ◦ Malicious URLs cannot be known before a certain amount of prevalence in the wild 4 Two Major Components of PhishNet URL prediction component ◦ Generate new URLs (child) from known phishing URLs (parent) by employing various heuristics ◦ Test whether the new URLs generated are indeed malicious Approximate URL matching component ◦ Perform an approximate match of a new URL with the existing blacklist 5 Component 1: Heuristics for Generating New URLs Typical blacklist URLs structure ◦ http://domain.TLD/directory/filename?query string H1: Replacing TLDs H2: IP address equivalence H3: Directory structure similarity H4: Query string substitution H5: Brand name equivalence 6 Heuristics for Generating New URLs H1: Replacing TLDs ◦ 3, 210 effective top-level domains (TLDs) ◦ Replace the effective TLD of the parent URL with 3, 209 other effective TLDs H2: IP address equivalence ◦ Phishing URLs having same IP addresses are grouped together into clusters ◦ Create new URLs by considering all combinations of hostnames and pathnames 7 Heuristics for Generating New URLs (cont’d) H3: Directory structure similarity ◦ URLs with similar directory structure are grouped together ◦ Build new URLs by exchanging the filenames among URLs belonging to the same group ◦ Parent www.abc.com/online/signin/paypal.htm www.xyz.com/online/signin/ebay.htm ◦ Child www.abc.com/online/signin/ebay.htm www.xyz.com/online/signin/paypal.htm 8 Heuristics for Generating New URLs (cont’d) H4: Query string substitution ◦ Build new URLs by exchanging the query strings among URLs ◦ Parent www.abc.com/online/signin/ebay?XYZ www.xyz.com/online/signin/paypal?ABC ◦ Child www.abc.com/online/signin/ebay?ABC www.xyz.com/online/signin/paypal?XYZ 9 Heuristics for Generating New URLs (cont’d) H5: Brand name equivalence ◦ Build new URLs by substituting brand names occurring in phishing URLs with other brand names 10 Component 1: Verification Conduct a DNS lookup to filter out sites that cannot be resolved For each of the resolved URLs ◦ Try to establish a connection to the corresponding server For each successful connection ◦ Initiate a HTTP GET request to obtain content from the server If the HTTP header from the server has status code 200/202 (successful request) ◦ Perform a content similarity between the parent and the child URLs If the URL’s content has sharp resemblance (above say 90%) with the parent URL ◦ Conclude that the child URL is a bad site 11 Component 2: Approximate Matching Determine whether a given URL is a phishing site or not 12 M1: Matching IP Address Perform a direct match of the IP address of URL with the IP addresses of the blacklist entries Assign a normalized score based on the number of blacklist entries that map to a given IP address If IP address IPi is common to ni URLs min{ni} (max{ni}): the minimum (maximum) of the number of phishing URLs hosted by blacklisted entries of IP addresses 13 M2: Matching Hostname Perform hostname match with those in the blacklist Domains of phishing URLs ◦ Specifically registered for hosting phishing sites ◦ Hosted on free/paidfor web-hosting services (WHS) Identify whether an incoming URL consists of a WHS or not ◦ Matching WHSes ◦ Matching non-WHSes 14 M2: Matching Hostname (cont’d) 15 M3: Matching Directory Structure Perform directory structure match with those in the blacklist Philosophy of this design ◦ H3 (directory structure similarity) ◦ H4 (query string substitution) ni: the number of URLs corresponding to a directory structure 16 M4: Matching Brand Names Check for existence of brand names in pathname and query string of URLs n i: the number of occurrences of the brand name Compute a final cumulative score ◦ Assign different weights to different modules 17 Evaluation: Component 1 Collect 6,000 URLs from PhishTank (2009/7/2 ~ 2009/7/25) 18 Evaluation: Component 2 How many benign (malicious) sites are (not) flagged as malicious Data source ◦ Phishing URLs PhishTank (consists of about 18, 000 URLs) SpamScatter (14, 000 URLs) ◦ Benign URLs DMOZ (100, 000 benign URLs ) 20, 000 benign URLs from Yahoo Random URL generator (YRUG) 19 Evaluation: Component 2 (cont’d) Training phase ◦ Create various data structures using the phishing URLs Testing phase ◦ An input URL is flagged as a phishing or a benign site Weight of individual modules ◦ W(M1, M2, M3, M4) = (1.0, 1.0, 1.5, 1.5) 20 Evaluation: Component 2 (cont’d) 21 Conclusion Address major problems associated with blacklists Two major components of PhishNet ◦ URL prediction component ◦ Approximate URL matching component Flag new URLs effectively 22