Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2015/4/13 Data Mining & Machine Learning Lab 1 Paper Information Authors: Ying Pan School of Information systems, Singapore Management University Xuhua Ding School of Information systems, Singapore Management University Source Annual Computer Security Application Conference 2006 (ACSAC’06) 2015/4/13 Data Mining & Machine Learning Lab 2 Outline Introduction Related Work Analysis of Phishing Pages Mechanism Architecture Identity Extractor Page Classifier Feature Vector Generation Experiments Experiments of Identity Extractor Experiments of Page Classifier Conclusion 2015/4/13 Data Mining & Machine Learning Lab 3 Introduction A common factor among all phishing sites Maliciously mislead users to believe that they are other legitimate sites Phishing site maliciously claims a false identity Proposed Method Use web DOM object to obtain web identity Use the web identity to capture phishing site anomalies 2015/4/13 Data Mining & Machine Learning Lab 4 Related Work Existing anti-phishing schemes Server based schemes Requiring server authentication to defend against phishing attacks Black listing services Browser based schemes Browser regulate web pages’ visual behaviors to prevent cheating Black list plug-in in browser Proactive schemes Detecting phishing pages based on visual similarity Detecting phishing pages by phishing-related activity 2015/4/13 Data Mining & Machine Learning Lab 5 Analysis of Phishing Pages Web identity: a set of words which uniquely identify the web site’s ownership in the cyberspace An abbreviation of organization’s full name Unique string appearing in its domain name Phishing web site with its own identity A attempts to claim a false identity B A list of characteristics of phishing pages Based on study of about 300 phishing sites from APWG’s repository List I & List II 2015/4/13 Data Mining & Machine Learning Lab 6 Mechanism Architecture Identity Extractor Page Classifier Feature Vector Generation 2015/4/13 Data Mining & Machine Learning Lab 7 Architecture 2015/4/13 Data Mining & Machine Learning Lab 8 Identity Extractor Extract identity from DOM objects/properties Title Description Copyright ALT/title Address Body Related DOM objects/properties Extract identity by following steps Form an identity relevant object set D Initiates a word set W from D as identity candidates Use Chi-square to separate identity from ordinary words Identity Extraction Algorithm (I, II) 2015/4/13 Data Mining & Machine Learning Lab 9 Page Classifier Support Vector Machine LibSVM Feature Vector Generation Given the identity set I 10 features are extracted 2015/4/13 Data Mining & Machine Learning Lab 10 Feature Vector Generation Feature 1: URL address F1 = 1 if no identity in URL address F1 = 0 if one page only use IP and can not be resolved into host name F1 = -1 otherwise Feature 2: DNS record F2 = -1 if all identity are substrings of DNS record R F2 = 0 if no record returned F2 = 1 otherwise 2015/4/13 Data Mining & Machine Learning Lab 11 Feature Vector Generation (cont.) Feature 3.1-3.3: URL of anchor F31: Nil anchor (point to nothing) F23: ID anchor (point to another domain contains identity) F33: Domain anchor (point to a foreign domain) 2015/4/13 Data Mining & Machine Learning Lab 12 Feature Vector Generation (cont.) Feature 4: Server form handler F4 = 1 if any void or foreign form handler exists F4 = 0 if no form F4 = -1 otherwise Feature 5.1-5.2: Request URL F51: ID Request URL (point to another domain contains identity) F52: Domain request URL (point to a foreign domain) 2015/4/13 Data Mining & Machine Learning Lab 13 Feature Vector Generation (cont.) Feature 6: Domain in cookie F6 = 1 if any foreign domain exists in cookie F6 = 0 if no domain in cookies of no cookies F6 = -1 otherwise Feature 7: Certificate in SSL F7 = 1 if one of the claimed identities does not appear in the certificate or URL specified in the certificate is different from L F7 = 0 if the SSL is not applied F7 = -1 otherwise 2015/4/13 Data Mining & Machine Learning Lab 14 Experiments Dataset 279 Phishing pages vs. 100 official pages 279 attacks only have 49 different targets Experiments of Identity Extractor Three web pages results Success rate Experiments of Page Classifier Dataset Training set size: 50 positive + 50 negative Testing set size: 50 pages Positive portions: 2%, 6%, 10%, 20%, 30%, 40%, 50% Use FP rate and miss rate (FN rate) as measurement 2015/4/13 Data Mining & Machine Learning Lab 15 Exp. of Identity Extractor Identity Extraction Results of Three Web Pages Success Rate(λ) of the Identity Extractor N is total number n is correct number 2015/4/13 Data Mining & Machine Learning Lab 16 Exp. of Page Classifier 2015/4/13 Data Mining & Machine Learning Lab 17 Exp. of Page Classifier (cont.) 2015/4/13 Data Mining & Machine Learning Lab 18 Conclusion The benefits Need not requires online interactions with a third party Also need not users to change their navigation behavior Resistant to adaptive phishing attackers 2015/4/13 Complete evasion of this scheme tolls attacker a high cost Data Mining & Machine Learning Lab 19 Characteristics of Phishing Pages I Disguised Keyword/Description Phishing page will use the fake identity to pretend a normal site Abnormal URL The hostname in URL or revolved from the IP does not match the claimed identity Abnormal DNS record DNS usually contains identity information Abnormal Anchors Domains of anchors’ URL are different from the page’s domain and these domains contain the claimed identity Anchors do not link to any page 2015/4/13 Data Mining & Machine Learning Lab 20 Characteristics of Phishing Pages II Abnormal Server Form Handler No action of the form or the action handled by a server in different domain Abnormal request URL Phishing site usually has objects referenced to real site Abnormal cookie Phishing sites’ cookie either point to its domain (inconsistent of claimed identity) or point to the real site (inconsistent with its own domain) Abnormal certificate in SSL The Distinguished Names in the certificates are inconsistent with the claimed identities 2015/4/13 Data Mining & Machine Learning Lab 21 Identity Extraction Algorithm Input: Web page P; Output: Identity set I Construction of object set D From the related DOM objects/properties Construction of word set W Tokenization by stop marks, remove stop words and stemming Remove all stop words object d from D Calculation of the occurrences Cw,d Supplement of body object Calculation of term frequency 2015/4/13 Data Mining & Machine Learning Lab 22 Identity Extraction Algorithm (cont.) Calculation of expected probability Where Calculation of χ2 value Output an identity set with the largest χ2 value 2015/4/13 Data Mining & Machine Learning Lab 23 Related DOM objects/properties 2015/4/13 Data Mining & Machine Learning Lab 24