Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs” Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker KDD 2009 By Fu-Chi Ao Questions • What’s the error rate? • What are the relevant/dominant features out of the selected 30783 features? • Indication of TTL values? • How to construct the feature vectors? • What are the 3959 features of WHOIS information features? 2009/9/15 2 What’s the error rate? (In binary classification) Real Value Positive Positive True Positive Negative False Negative Test Outcome Negative False Positive True Negative • Accuracy: The proportion of the true results in the population Accuracy # of true positives # of true negatives # of total test resu lts • Error rate = 1 – Accuracy Error rate 2009/9/15 # of false positives # of false negatives # of total test outcomes 3 What are the relevant/dominant features out of the selected 30783 features? non-zero features benign malicious • Breakdown of features for L1regularized LR for an instance of the Yahoo-PhishTank data set – The training phase for L1-regularized LR yields a sparse parameter vector w – Focus on a smaller number of relevant features 2009/9/15 4 Certain “Red Flags" Indicate Malicious Intent • 1) Suspicious ownership of the site – Benign features: IP ranges belonging to Google, Yahoo and AOL – Malicious features: having an NS record in one of the IP prefixes run by GoDaddy • 2) Where the site is hosted geographically – Top-6 benign features: ‘.gov’, ‘.edu’, ‘.com’, ‘.org’, ‘.ca’ and ‘.se’ – Top-6 malicious features: ‘.info’, ‘.kr’, ‘.it’, ‘.hu’, and ‘.es’ • 3) The registration date of the site – Malicious: a recent registration or update date/missing any of the three WHOIS dates (registration, update, expiration) • 4) What kind of connection the server is using – Top-2 benign features: have T1 speed for the DNS A and MX records – Malicious sites hosted on compromised machines in residential ISPs • 5) The presence of certain URL extensions • "bankofamerica.com" vs. "bankofamerica.com.cz.rnl" 2009/9/15 5 What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Machine learning techniques can adapt to differing feature distributions by learning the appropriate decision rules automatically – The results of experiments show that different data sets provide different feature distributions for distinguishing malicious and benign URLs – Rather than manually discovering and adjusting the decision rules for different data sets 2009/9/15 6 What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Automation of the classifier – Select malicious and benign features for which domain experts had prior intuition – Automatically selected new, non-obvious features that were highly predictive and yielded additional, substantial performance improvements 2009/9/15 7 Indication of TTL values? • “What is the time-to-live (TTL) value for the DNS records associated with the hostname?” • Set by an authoritative names server for a particular resource record • Low TTL value – Some well-known larger web sites depend on low TTL values to enable quick changes to their web sites • e.g. “www.cnn.com” – Some small web-sites require frequent DNS updates (when their IP address changes) • run on ADSL or cable connections with dynamic IP addresses 2009/9/15 8 How to construct the feature vectors? • Use the selected features to encode individual URLs as very high dimensional feature vectors – Most generated by the “bag-of-words" representation of the URL, registrar name, and registrant name – Binary features are also used to encode all possible ASes, prefixes and geographic locales of an IP address • The resulting URL descriptors typically have tens of thousands of binary features • Overfitting – Not know in advance which features are relevant • Though only a subset of the generated features may correlate with malicious Web site – When there are more features than labeled examples prone to overfitting! 2009/9/15 9 Feature vector construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad” ... [__ … Real-valued 000111…1 0 Host-based 1 Lexical 1 …] No clear illustration for the construction methodology… 2009/9/15 10 What are the 3959 features of WHOIS information features? • A distributed database contains contact information – the owner and registrar of the domain (including home page URL) – date of registration, last update, expiration – primary and secondary DNS servers – and any additional status information of the domain • Mainly tokens in the names of the registrar and registrant of the domain name 2009/9/15 11