Beyond Blacklists: Learning to Detect Malicious Web Sites

advertisement
Q&A for “Beyond Blacklists:
Learning to Detect Malicious Web
Sites from Suspicious URLs”
Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker
KDD 2009
By Fu-Chi Ao
Questions
• What’s the error rate?
• What are the relevant/dominant features out
of the selected 30783 features?
• Indication of TTL values?
• How to construct the feature vectors?
• What are the 3959 features of WHOIS
information features?
2009/9/15
2
What’s the error rate?
(In binary classification)
Real Value
Positive
Positive
True Positive
Negative
False
Negative
Test Outcome
Negative
False
Positive
True
Negative
• Accuracy: The proportion of the true results in the population
Accuracy 
# of true positives  # of true negatives
# of total test resu lts
• Error rate = 1 – Accuracy
Error rate 
2009/9/15
# of false positives  # of false negatives
# of total test outcomes
3
What are the relevant/dominant features
out of the selected 30783 features?
non-zero
features
benign
malicious
• Breakdown of
features for L1regularized LR for an
instance of the
Yahoo-PhishTank
data set
– The training phase
for L1-regularized LR
yields a sparse
parameter vector w
– Focus on a smaller
number of relevant
features
2009/9/15
4
Certain “Red Flags" Indicate Malicious
Intent
• 1) Suspicious ownership of the site
– Benign features: IP ranges belonging to Google, Yahoo and AOL
– Malicious features: having an NS record in one of the IP prefixes run by
GoDaddy
• 2) Where the site is hosted geographically
– Top-6 benign features: ‘.gov’, ‘.edu’, ‘.com’, ‘.org’, ‘.ca’ and ‘.se’
– Top-6 malicious features: ‘.info’, ‘.kr’, ‘.it’, ‘.hu’, and ‘.es’
• 3) The registration date of the site
– Malicious: a recent registration or update date/missing any of the
three WHOIS dates (registration, update, expiration)
• 4) What kind of connection the server is using
– Top-2 benign features: have T1 speed for the DNS A and MX records
– Malicious sites hosted on compromised machines in residential ISPs
• 5) The presence of certain URL extensions
• "bankofamerica.com" vs. "bankofamerica.com.cz.rnl"
2009/9/15
5
What are the relevant/dominant features
out of the selected 30783 features? (cont’d)
• Machine learning techniques can adapt to
differing feature distributions by learning the
appropriate decision rules automatically
– The results of experiments show that different
data sets provide different feature distributions
for distinguishing malicious and benign URLs
– Rather than manually discovering and adjusting
the decision rules for different data sets
2009/9/15
6
What are the relevant/dominant features
out of the selected 30783 features? (cont’d)
• Automation of the classifier
– Select malicious and benign features for which
domain experts had prior intuition
– Automatically selected new, non-obvious
features that were highly predictive and yielded
additional, substantial performance
improvements
2009/9/15
7
Indication of TTL values?
• “What is the time-to-live (TTL) value for the DNS
records associated with the hostname?”
• Set by an authoritative names server for a
particular resource record
• Low TTL value
– Some well-known larger web sites depend on low TTL
values to enable quick changes to their web sites
• e.g. “www.cnn.com”
– Some small web-sites require frequent DNS updates
(when their IP address changes)
• run on ADSL or cable connections with dynamic IP addresses
2009/9/15
8
How to construct the feature vectors?
• Use the selected features to encode individual URLs as very
high dimensional feature vectors
– Most generated by the “bag-of-words" representation of the
URL, registrar name, and registrant name
– Binary features are also used to encode all possible ASes,
prefixes and geographic locales of an IP address
• The resulting URL descriptors typically have tens of
thousands of binary features
• Overfitting
– Not know in advance which features are relevant
• Though only a subset of the generated features may correlate with
malicious Web site
– When there are more features than labeled examples  prone
to overfitting!
2009/9/15
9
Feature vector construction
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration:
3/25/2009
Hosted from
208.78.240.0/22
IP hosted in San Mateo
Connection speed: T1
Has DNS PTR record? Yes
Registrant “Chad”
...
[__
…
Real-valued
000111…1 0
Host-based
1
Lexical
1 …]
No clear illustration for the construction methodology…
2009/9/15
10
What are the 3959 features of WHOIS
information features?
• A distributed database contains contact
information
– the owner and registrar of the domain (including
home page URL)
– date of registration, last update, expiration
– primary and secondary DNS servers
– and any additional status information of the domain
• Mainly tokens in the names of the registrar and
registrant of the domain name
2009/9/15
11
Download