Detecting Fake Websites: The Contribution of Statistical Learning

advertisement
Detecting Fake Websites: The
Contribution of Statistical Learning
Theory
Abbasi, Zhang, Zimbra, Chen,
Nunamaker
MISQ, 34(3), 2010
MISQ Best Paper Award, 2011
1
Introduction
•
The increased popularity of the Internet has attracted
opportunists.
– Seeking to capitalize on the asymmetric nature of online
information exchange.
– Consequently many forms of fake and deceptive websites
have appeared (Chua & Wareham, 2004):
– Web Spam
• Sites attempting to deceive search engines to boost their ranking
(Gyongi and Garcia-Molina, 2005).
• Objective: search engine optimization (SEO)
• Often done for profit (site for sale)
• Typically do not attempt to defraud Internet users
– Concocted Sites
• Fraudulent sites attempting to appear as legitimate commercial
service providers.
• Objective: failure-to-ship fraud (Chua and Wareham, 2004)
• E.g., fake escrow, financial, and delivery company sites (Abbasi
and Chen, 2007).
– Spoof Sites
• Replicas of real commercial sites intended to deceive the
authentic sites’ customers (Chou et al., 2004).
• Objective: identity theft; capture ones account information
2
Introduction
• We focus on spoof and concocted
websites
– Since they’re used to defraud end users
• Concocted sites
– Becoming increasingly common, with over one
hundred new entries added daily to online databases
such as the Artists Against 4-1-9.
• Spoof sites
– According to a 2004 survey, 70% of respondents had
visited spoof sites and 15% admitted to providing
personal data to spoofs (Wu et al., 2006).
3
Introduction
• Fake websites are often very well professional looking
and difficult to identify as phony (MacInnes et al., 2005).
• In response to increasing Internet user awareness,
fraudsters also becoming more sophisticated (Levy and
Arce, 2004).
– As a result, there is a need for enhanced fake website detection
techniques (Chou et al., 2004).
• Such methods are important to decrease Internet fraud
stemming from phony websites.
4
Introduction
• Numerous tools have been proposed, however they have several
shortcomings:
– Most are lookup systems: rely solely on manually crafted blacklists of
fake URLs
• Lists generated from user reports, making them reactive.
– Few systems using proactive classification techniques have been
proposed
• Those that have utilize overly simplistic features and classification heuristics
– Most systems are geared towards spoof sites
• It is unclear how effective they would be at detecting generated fraud sites
• We propose a statistical learning theory (SLT) based system for
detecting fake websites
– Capable of detecting generated fraud and spoof sites
– Uses a rich feature set and composite SVM kernel for enhanced fake
website detection capabilities
– Can be combined with a lookup mechanism for hybridized detection
using a dynamic classifier
5
Fake Website Detection Tools
• Several tools developed for identifying and
protecting against fake websites.
• Fake website detection tools belong to two
categories:
– Lookup Systems
– Classifier Systems
6
Fake Website Detection Tools
• Lookup Systems
– Description
• Use a client-server architecture (Li and Helenius, 2007)
• Server side maintains blacklist of known fake site URLs (Zhang et al., 2007)
• Rely on collaborative sanctioning mechanisms similar to reputation ranking (Hariharan
et al., 2007)
• Examples are IE7 Phishing Filter, FirePhish, Sitehound, and the Earthlink Toolbar
– Advantages
• High precision: less likely to report false positives, i.e., considering an authentic site
fake (Zhang et al., 2007)
– Since all URLs in database are verified by the online sources they are taken from.
• Computationally faster than classifier systems
• Easier to implement
– Disadvantages
• Lower recall: more likely to report false negatives (i.e., overlooking fake websites)
– Since database is limited to small number of online resources, may lack coverage.
• Lookup systems are reactive by nature; depending on users to report URLs (Liu et al.,
2006)
7
Fake Website Detection Tools
• Classifier Systems
– Description
• Use rule based heuristics or similarity scores
• Applied to website content or domain registration information (Wu et al., 2006;
Zhang et al., 2007)
• Classifier systems run on the client side
• Example are SpoofGuard, Netcraft, and eBay Account Guard
– Advantages
• Can provide better coverage (i.e., recall) for spoof and generated fake sites than
lookup systems
– Depending on the classification heuristics, rules, and/or models used.
• Classifier systems are proactive
– Disadvantages
• Classifiers can be more computationally expensive, taking longer to classify web
pages than lookup systems
• More prone to false positives
• Generalization ability of classification models over time can be an issue
– Especially if the fake websites are constantly changing and evolving
– In such situations the classification model must also adapt and relearn
8
Summary of Fake Website Detection Tools
Tool Name
System Type
Classifier
Website Type
Prior Results
Lookup
Cloudmark
None
Server-side blacklist
Spoof sites
Accuracy: 83.9%
Spoof Detection: 45.0%
EarthLink Toolbar
None
Server-side blacklist
Spoof sites
Accuracy: 90.5%
Spoof Detection: 68.5%
eBay Account Guard
Text and image content
similarity to eBay and Paypal
websites
Server-side blacklist
Spoof sites
(primarily of eBay
and PayPal)
Accuracy: 83.2%
Spoof Detection: 40.0%
FirePhish
None
Server-side blacklist
Spoof sites
Accuracy: 89.2%
Spoof Detection: 61.5%
IE7 Phishing Filter
None
Client-side whitelist,
server-side blacklist
Spoof sites
Accuracy: 92.0%
Spoof Detection: 71.5%
Netcraft
Domain registration
information
Server-side blacklist
Generated sites,
spoof sites
Accuracy: 91.2%
Spoof Detection: 68.5%
SiteWatcher
Text and image feature
similarity, stylistic feature
correlation
Client-side whitelist
Spoof sites
N/A
Sitehound
None
Server-side blacklist
downloaded by client
Generated sites,
spoof sites
N/A
SpoofGuard
Image hashes, password
encryption, URL similarities,
domain registration information
None
Generated sites,
spoof sites
Accuracy: 67.7%
Spoof Detection: 93.5%
GeoTrust TrustWatch
None
Server-side blacklist
Spoof sites
Accuracy: 85.1% 9
Spoof Detection: 46.5%
Summary of Fake Website Detection Tools
• Existing systems’ performance is inadequate due to
insufficient use of “fraud cues”
– Could be useful since fake websites are often “templatic”
– Fraudsters automatically mass-produce fake websites
• There has been no prior evaluation on concocted sites
• There’s been limited use of classifiers evaluating page
content
• Limited utilization of hybrid systems that combine
classifiers with a lookup mechanism.
10
Fraud Cues in Fake Website Templates
•
•
•
•
•
Body text
Web page source code
URLs
Images
Linkage information
11
Fake Website Detection using SLT-based Methods
• In summary, effective fake website detection
systems must:
– Generalize across diverse collections of concocted
and spoof websites.
– Incorporate rich sets of fraud cues.
– Leverage important domain-specific knowledge:
stylistic similarities and content duplication.
– Provide long term sustainability against dynamic
adversaries by adapting to changes.
12
Fake Website Detection using SLT-based Methods
• SLT also provides a mechanism for addressing the four important
characteristics necessary for effective fake website detection
systems.
– Ability to generalize
• The “maximum margin” principle and corresponding optimization techniques
employed by SLT-based classifiers set out to minimize classification error
while simultaneously maximizing their generalization capabilities
– Rich fraud cues
• Since SLT-based classifiers transform input data into a kernel matrix, they
are able to utilize sizable input feature spaces
– Utilization of domain knowledge
• By supporting the use of custom kernels, SLT-based classifiers are able to
incorporate unique problem nuances and intricacies, while preserving the
semantic structure of the input data space
– Dynamic learning
• As with other learning-based classifiers, SLT-based classifiers can also
update their models by relearning on newer, more up-to-date training
collections of real and fake websites
13
Research Hypotheses
• Since classifier systems can better generalize than lookup systems:
– H1: Any non-trivial classifier system, rule or learning-based, will
outperform systems relying exclusively on a lookup mechanism.
• Since SLT-based classifiers can incorporate large sets of fraud
cues:
– H2: SLT-based website classifiers will outperform rule-based classifiers.
• Since SLT-based classifiers can incorporate domain knowledge via
custom kernels:
– H3: SLT-based learning classifiers will outperform other machine
learning algorithms.
• SLT-based classifiers, equipped with custom, problem-specific
kernel functions, can better preserve important fraud cue relations:
– H4: SLT-based classifiers using well-designed kernels will outperform
ones using generic kernel functions
14
AZProtect System Overview
• Developed an SLT-based fake website detection system
– Uses rich feature set and SVM kernel based machine learning
classifier.
– Capable of classifying concocted and spoof sites.
– Evaluates multiple web pages from a potential site for improved
performance
• Prior systems only evaluated single URL
– Feature set utilizes over 5,000 features from 5 information types:
• Body text, HTML design, Images, Linkage, and URLs.
• Features extracted and classifier built on 1,000 training websites
collected 6 months before the testing websites.
– Independent of test bed (no overlap).
– Support Vector Machine classifier
• Uses a linear composite kernel
• Tailored towards representing the content similarity and duplication15
tendencies of fake websites.
AZProtect System Overview
Linear composite kernel compares pages’ feature vectors against training site pages
Considers average and maximum similarity for pattern and duplication detection
Also incorporates page linkage and structure information in each comparison
Considers website fake if greater than n% of its pages are classified as fake
Represent each page a with thevectors :
xa  {Simave (a, b1 ),...,Simave (a, b p )}; ya  {Simmax (a, b1 ),...,Simmax (a, b p )}
Where:

lv  lvk
Sim(a, k )    1  a

lva  lvk

 
in  in k
  1  a
 
in a  in k
 
 
out a  out k
  1 
 
out a  out k
 

 1 n a  ki
   1   1   i

 n i 1 a  k
i
i






1 m
 Sim(a, k )
m k 1
Simmax (a, b)  arg max Sim(a, k )
Simave (a, b) 
kpages in site b
For :
b  p web sites in the training set; k  m pages in site b; a1, ...an and k1 ,...k n are page a and k ' s feature vectors;
lva , in a , and out a are the page level and number of in/out links for page a;
T he similaritybetween two pages is defined as the inner product between their two vectors x1 , x2 and y1 , y2 :
Linear CompositeKernel : K ( x1  y1 , x2  y2 ) 
x1 , x2
x1 , x1 x2 , x2

y1 , y2
y1 , y1 y2 , y2
16
AZProtect System Overview
Illustration of Page-Page and Page-Site Similarity Scores used in the Linear
Composite Kernel Function
17
AZProtect System Overview
Kernel Illustration: Comparing Two Web Pages against Legitimate
and Fake Websites
18
AZProtect System Overview
1
4
2
5
3
6
19
Evaluation Test Bed
• We evaluated 350 fake generated websites and 350
spoof sites over a 6 week period.
• Taken from 4 online databases (Liu et al., 2006; Zhang
et al., 2007):
– Concocted Sites
• Artists Against 4-1-9
– http://wiki.aa419.org
• Escrow Fraud Online
– http://escrow-fraud.com
– Spoof Sites
• PhishTank
– http://www.phishtank.com
• Anti-Phishing Working Group (APWG)
– http://www.antiphishing.org
• Also evaluated 200 legitimate sites.
– Comprised of websites commonly spoofed or those relevant to
concocted websites
• Resulted in 900 website test bed
20
Comparison of Classifier and Lookup Systems
• AZProtect had the best overall performance and fake
website detection accuracy on both test beds.
– Netcraft and SpoofGuard also had decent performance on both
data sets
– Sitehound performed poorly on both test beds, with the worst
performance on each
– FirePhish, IE7, and SpoofGuard also fared well on the spoof site
test bed, but not on concocted sites
21
H1 and H2 Results
• Conducted pair-wise t-tests on overall accuracy, concocted, and
spoof detection rates.
• H1: Classifier vs. Lookup Systems
– Compared the performance of the four classifier systems against the
four lookup-based tools.
– AZProtect and Netcraft significantly outperformed the four lookup
systems for all three evaluation metrics (p-values < 0.001)
– SpoofGuard also significantly outperformed all lookup systems in terms
of overall accuracy and concoction detection rates.
• H2: Learning vs. Rule-based Classifier Systems
– AZProtect significantly outperformed all three comparison classification
systems (all p-values < 0.001).
– The SLT-based system’s ability to incorporate a rich set of fraud cues
allowed it to better detect fake websites than existing rule-based
classifier systems.
22
Comparison of Learning Classifiers
• An important element of the AZProtect system is its
linear composite SVM kernel.
– Compared it with several learning methods applied to related
classification problems, including text, style, and website
categorization
– All algorithms were trained on the same set of 1,000 websites
– H3: SLT-based learning classifier vs. other learning classifiers
• The linear composite SVM kernel significantly outperformed all six
comparison methods in terms of overall accuracy and its spoof detection
rate.
• Also significantly outperformed Naïve bayes, Winnow, and Neural Net on
concocted websites.
• However it was outperformed by J48 on the concocted websites.
23
Comparison of Static and Dynamic Learning Classifiers
• We compared the custom linear composite kernel
against other generic kernel functions.
– The comparison kernels did not incorporate problem-specific
characteristics related to the fake website domain.
• Linear kernel that weighted all attributes in the input feature vectors equally
(Ntoulas et al., 2006)
• Linear kernel that weighted each attribute in the feature vector based on its
information gain score (attained on the training data).
• Additionally, 2nd and 3rd degree polynomial kernels and a radial basis
function kernel were incorporated (Drost and Scheffer, 2005).
– H4: Custom linear kernel vs. other kernels
• Proposed kernel significantly outperformed comparison kernels on 21 out of
25 conditions.
24
Conclusions and Future Directions
• Contributions
– Advocated the development of SLT-based fake website detection
systems
• Used experiments to show that SLT-based systems can improve fake
website detection capabilities
• Due to better generalization ability, ability to use rich fraud cues and custom
kernels, and through the use of dynamic learning.
– Proposed an improved SLT-based fake website detection system
• SVM classifier with a composite linear kernel and rich feature set
• Evaluated effectiveness of static and dynamic classifiers
– Compared various state-of-the-art systems for fake site detection
• Applied to concocted and spoof sites
• Future Directions
– Usability study of proposed AZProtect system
• Compare effectiveness of various toolbar layouts (Wu et al., 2006)
– Improve computation time of system
• Currently 2.9 seconds per website
• Other systems are between 0.5 – 2.0 seconds (Chou et al., 2004; Liu et al.,25
2006)
References
•
Abbasi, A. and Chen, H. “Detecting Fake Escrow Websites using Rich Fraud Cues and Kernel Based Methods,” Paper
Submitted to the Workshop on Information Technologies and Systems, Montreal, Canada, 2007.
•
Chou, N. Ledesma, R., Teraguchi, Y., Boneh, D. and Mitchell, J. C. “Client-side Defense Against Web-based Identity
Theft,” In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA., 2004.
•
Chua, C. E. H. and Wareham, J. “Fighting Internet Auction Fraud: An Assessment and Proposal,” IEEE Computer, (37:10),
2004, pp. 31–37.
•
Gyongi, Z. and Garcia-Molina, H. “Spam: It’s not Just for Inboxes Anymore,” IEEE Computer, (38:10), 2005, pp. 28-34.
•
Hariharan, P., Asgharpour, F., and Jean Camp, L. “NetTrust – Recommendation System for Embedding Trust in a Virtual
Realm,” In Proceedings of the ACM Conference on Recommender Systems, Minneapolis, Minnesota, 2007.
•
Levy, E. and Arce, I. “Criminals Become Tech Savvy,” IEEE Security and Privacy, (2:2), 2002, pp. 65-68.
•
Li, L. and Helenius, M. “Usability Evaluation of Anti-Phishing Toolbars,” Journal in Computer Virology, (3:2), 2007, pp. 163184.
•
Liu, W., Deng, X., Huang, G., and Fu, A. Y. “An Antiphishing Strategy Based on Visual Similarity Assessment,” IEEE
Internet Computing, (10:2), 2006, pp. 58-65.
•
MacInnes, I., Damani, M., and Laska, J. “Electronic Commerce Fraud: Towards an Understanding of the Phenomenon,” In
Proceedings of the Hawaii International Conference on Systems Sciences (HICSS), 2005.
•
Wu, M., Miller, R. C., and Garfunkel, S. L. “Do Security Toolbars Actually Prevent Phishing Attacks?,” In Proceedings of the
Conference on Human Factors in Computing Systems, Montreal, Canada, 2006, pp. 601-610.
•
Zhang, Y., Egelman, S., Cranor, L. and Hong, J. “Phinding Phish: Evaluating Anti-phishing Tools,” In Proceedings of the
14th Annual Network and Distributed System Security Symposium (NDSS), 2007.
26
Download