Anomaly based Web Phishing Page Detection

Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2015/4/13 Data Mining & Machine Learning Lab 1 Paper Information  Authors:  Ying Pan  School of Information systems, Singapore Management University  Xuhua Ding  School of Information systems, Singapore Management University  Source  Annual Computer Security Application Conference 2006 (ACSAC’06) 2015/4/13 Data Mining & Machine Learning Lab 2 Outline     Introduction Related Work Analysis of Phishing Pages Mechanism  Architecture  Identity Extractor  Page Classifier  Feature Vector Generation  Experiments  Experiments of Identity Extractor  Experiments of Page Classifier  Conclusion 2015/4/13 Data Mining & Machine Learning Lab 3 Introduction  A common factor among all phishing sites  Maliciously mislead users to believe that they are other legitimate sites  Phishing site maliciously claims a false identity  Proposed Method  Use web DOM object to obtain web identity  Use the web identity to capture phishing site anomalies 2015/4/13 Data Mining & Machine Learning Lab 4 Related Work  Existing anti-phishing schemes  Server based schemes   Requiring server authentication to defend against phishing attacks Black listing services  Browser based schemes  Browser regulate web pages’ visual behaviors to prevent cheating  Black list plug-in in browser  Proactive schemes  Detecting phishing pages based on visual similarity  Detecting phishing pages by phishing-related activity 2015/4/13 Data Mining & Machine Learning Lab 5 Analysis of Phishing Pages  Web identity: a set of words which uniquely identify the web site’s ownership in the cyberspace  An abbreviation of organization’s full name  Unique string appearing in its domain name  Phishing web site with its own identity A attempts to claim a false identity B  A list of characteristics of phishing pages  Based on study of about 300 phishing sites from APWG’s repository  List I & List II 2015/4/13 Data Mining & Machine Learning Lab 6 Mechanism  Architecture  Identity Extractor  Page Classifier  Feature Vector Generation 2015/4/13 Data Mining & Machine Learning Lab 7 Architecture 2015/4/13 Data Mining & Machine Learning Lab 8 Identity Extractor  Extract identity from DOM objects/properties  Title  Description  Copyright  ALT/title  Address  Body  Related DOM objects/properties  Extract identity by following steps  Form an identity relevant object set D  Initiates a word set W from D as identity candidates  Use Chi-square to separate identity from ordinary words  Identity Extraction Algorithm (I, II) 2015/4/13 Data Mining & Machine Learning Lab 9 Page Classifier  Support Vector Machine  LibSVM  Feature Vector Generation  Given the identity set I  10 features are extracted 2015/4/13 Data Mining & Machine Learning Lab 10 Feature Vector Generation  Feature 1: URL address  F1 = 1 if no identity in URL address  F1 = 0 if one page only use IP and can not be resolved into host name  F1 = -1 otherwise  Feature 2: DNS record  F2 = -1 if all identity are substrings of DNS record R  F2 = 0 if no record returned  F2 = 1 otherwise 2015/4/13 Data Mining & Machine Learning Lab 11 Feature Vector Generation (cont.)  Feature 3.1-3.3: URL of anchor  F31: Nil anchor (point to nothing)  F23: ID anchor (point to another domain contains identity)  F33: Domain anchor (point to a foreign domain) 2015/4/13 Data Mining & Machine Learning Lab 12 Feature Vector Generation (cont.)  Feature 4: Server form handler  F4 = 1 if any void or foreign form handler exists  F4 = 0 if no form  F4 = -1 otherwise  Feature 5.1-5.2: Request URL  F51: ID Request URL (point to another domain contains identity)  F52: Domain request URL (point to a foreign domain) 2015/4/13 Data Mining & Machine Learning Lab 13 Feature Vector Generation (cont.)  Feature 6: Domain in cookie  F6 = 1 if any foreign domain exists in cookie  F6 = 0 if no domain in cookies of no cookies  F6 = -1 otherwise  Feature 7: Certificate in SSL  F7 = 1 if one of the claimed identities does not appear in the certificate or URL specified in the certificate is different from L  F7 = 0 if the SSL is not applied  F7 = -1 otherwise 2015/4/13 Data Mining & Machine Learning Lab 14 Experiments  Dataset  279 Phishing pages vs. 100 official pages  279 attacks only have 49 different targets  Experiments of Identity Extractor  Three web pages results  Success rate  Experiments of Page Classifier  Dataset   Training set size: 50 positive + 50 negative Testing set size: 50 pages  Positive portions: 2%, 6%, 10%, 20%, 30%, 40%, 50%  Use FP rate and miss rate (FN rate) as measurement 2015/4/13 Data Mining & Machine Learning Lab 15 Exp. of Identity Extractor  Identity Extraction Results of Three Web Pages  Success Rate(λ) of the Identity Extractor  N is total number  n is correct number 2015/4/13 Data Mining & Machine Learning Lab 16 Exp. of Page Classifier 2015/4/13 Data Mining & Machine Learning Lab 17 Exp. of Page Classifier (cont.) 2015/4/13 Data Mining & Machine Learning Lab 18 Conclusion  The benefits  Need not requires online interactions with a third party  Also need not users to change their navigation behavior  Resistant to adaptive phishing attackers  2015/4/13 Complete evasion of this scheme tolls attacker a high cost Data Mining & Machine Learning Lab 19 Characteristics of Phishing Pages I  Disguised Keyword/Description  Phishing page will use the fake identity to pretend a normal site  Abnormal URL  The hostname in URL or revolved from the IP does not match the claimed identity  Abnormal DNS record  DNS usually contains identity information  Abnormal Anchors  Domains of anchors’ URL are different from the page’s domain and these domains contain the claimed identity  Anchors do not link to any page 2015/4/13 Data Mining & Machine Learning Lab 20 Characteristics of Phishing Pages II  Abnormal Server Form Handler  No action of the form or the action handled by a server in different domain  Abnormal request URL  Phishing site usually has objects referenced to real site  Abnormal cookie  Phishing sites’ cookie either point to its domain (inconsistent of claimed identity) or point to the real site (inconsistent with its own domain)  Abnormal certificate in SSL  The Distinguished Names in the certificates are inconsistent with the claimed identities 2015/4/13 Data Mining & Machine Learning Lab 21 Identity Extraction Algorithm  Input: Web page P; Output: Identity set I  Construction of object set D  From the related DOM objects/properties  Construction of word set W  Tokenization by stop marks, remove stop words and stemming  Remove all stop words object d from D  Calculation of the occurrences Cw,d  Supplement of body object  Calculation of term frequency 2015/4/13 Data Mining & Machine Learning Lab 22 Identity Extraction Algorithm (cont.)  Calculation of expected probability  Where  Calculation of χ2 value  Output an identity set with the largest χ2 value 2015/4/13 Data Mining & Machine Learning Lab 23 Related DOM objects/properties 2015/4/13 Data Mining & Machine Learning Lab 24

Anomaly based Web Phishing Page Detection

Related documents

Products

Support

Anomaly based Web Phishing Page Detection

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib