Searching the Searchers with SearchAudit JOHN P. JOHN FANG YU YINGLIAN XIE MARTÍN ABADI ARVIND KRISHNAMURTHY PRESENTATION BY SAM KLOCK Motivation We can find this via a Google search Motivation (cont’d) Search engines open opportunities for attackers Construct clever queries Find vulnerable sites Plant malware; spam (e.g., MyDoom) Do so stealthily and cheaply Mitigation strategy: identify malicious queries May be able to deny results to user Identify attackers (probably bots) Interpret strategy, then anticipate and prevent The question: how to do so Proposed Approach SearchAudit Framework for generating malicious queries Input: Seed set of known malicious queries Search logs Output: Large set of suspicious queries Regular expressions matching queries Seed set Search logs inurl:gotoURL.asp?url= filetype:asp inurl:"shopdisplayprod ucts.asp" ext:pl inurl:cgi intitle:"FormMail *" -"*Referrer" -"* Denied" -sourceforge -error -cvs -input filetype:cgi inurl:tseekdir.cgi ... SearchAudit inurl:gotoURL.asp?url= inurl:gotoURL.asp?url= filetype:asp inurl:gotoURL.asp?url= filetype:asp inurl:"shopdisplayprod filetype:asp inurl:"shopdisplayprod ucts.asp" inurl:"shopdisplayprod ucts.asp" ext:pl inurl:cgi ucts.asp" ext:pl inurl:cgi intitle:"FormMail ext:pl inurl:cgi *"*" intitle:"FormMail -"*Referrer" -"* intitle:"FormMail *" -"*Referrer" -"* Denied" -sourceforge -"*Referrer" -"* Denied" -sourceforge -error -cvs-sourceforge -input Denied" -error -cvs -input filetype:cgi -error -cvs -input filetype:cgi inurl:tseekdir.cgi filetype:cgi inurl:tseekdir.cgi ... inurl:tseekdir.cgi ... ... Expanded set "/includes/joomla\.php " site:\.[a-zAZ]{2,3} "/includes/class_item\ .php" site:[^?=#+@;&:]{2, 4} "php-nuke" site:[^?=#+@;&:]{2, 4} "modules\.php\?op=modl oad" site:\.[a-zAZ0-9]{2,6} Regular expressions Proposed Approach (cont’d) Needed to implement: Seed set: milw0rm.com Search logs: Microsoft Research Bing Way to expand seed set into more queries Way to infer regular expressions Intended benefits: Harvesting lots of information Three months: ~1.2 TB of logs Interpret relationship between queries and attacks Use queries to find potential victims Stop attacks SearchAudit Query identification Query analysis Query Identification: Expansion Basic idea: bootstrap on seed set Search logs for exact matches to seed queries Record IPs of hosts making seed queries Add other queries from those IPs to set Intuition: make one malicious query, will probably make more Account for DHCP Seed queries Log search IP addresses Queries made on same day Queries made by IPs Query Identification: Regular Expressions Goals: Account for variation in queries Take advantage of scripting See paper for generation algorithm Compute score for generated expressions Lower score: more specific Goal: discard overly general expressions (score > 0.6) Consolidate to avoid overlap Avoid proxies, public NAT for performance Loopback for more queries Query Identification: Results Data from Bing and milw0rm 500 queries Logs for Feb. 2009, Dec. 2009, Jan. 2010 ~2 billion views per month System implemented on Dryad/DryadLINQ Initial observations: Using specificity scores < 0.6 seems to be effective Based on cookie heuristic Proxy elimination does not limit results Query Identification: Results (cont’d) Query expansion: 122 of 500 queries matched in logs: 174 unique IPs Expanded to 800 unique queries, 264 IPs Regular expressions matched 3,560 queries, 1,001 IPs Incomplete seeds Tried with subsets of original set Coverage still good Query Identification: Results (cont’d) Loopback: Multiple loopbacks got more results One iteration is good enough Overall statistics 10,000s IPs each month 100,000s unique queries each month Dec. 09: set of unusual attacker IPs cause spike Query Identification: Verification Want to show queries are malicious Sometimes easy: 73% of queries associated with security/hacker sites What about others? Individual bots Groups of bots No ground truth exists Individual level (one IP) Group level (multiple IPs) Data often fixed by botnets User agent string Metadata for requests So: look for bot-like features New cookie Whether a link was clicked Tendencies dictated by scripts Pages viewed per query Time between queries Query Identification: Verification (cont’d) Substantial variation between host behavior for normal queries and suspicious queries Observations on Stage One Regular expressions can become obsolete Just need fresh logs and a new seed to get new ones Attacker awareness of technique yields adaptation Example: mix in normal user queries Goal: trick SearchAudit into identifying as proxy Hard to do: needs to be appropriate to time and place Anyway: proxy elimination is optimization only Injecting randomness also possible, but makes querying less productive Could obviate cookie heuristic, but it is replaceable All attackers need to be careful to succeed Query Analysis Query Analysis 42,000 IPs gave suspicious queries globally U.S., Russia, China contribute almost 50% 10% of IPs gave 90% of queries Found 200 regular expressions Reveal three kinds of attack-related queries: Vulnerable web sites Forum spamming Phishing on Windows Live Messenger Queries for Vulnerable Websites Queries look for exploitable inurl:index.php?content=X server vulnerabilities http://www.example.com/ind ex.php?content=X’%20OR%20’ 1’%20OR%20‘1=1’ GET variables embedded in URL (for SQL injection) Server software with known vulnerabilities (e.g., status pages) SearchAudit as a defense: Pull suspicious queries for vulnerabilities Run queries; gather results Inspect results for vulnerabilities Notify sites of vulnerabilities Queries for Vulnerable Websites (cont’d) With identified queries: Sampled 5,000 queries Obtained 80,490 URLs from 39,475 sites Compared to malware/phishing lists: 3-4% on anti-phishing lists 1.5% on anti-malware lists SQL injection vulnerability: Add a single-quote to variable in URL Look for SQL error 12% of examined URLs showed an error Queries for Forum Spamming Query motivation: Find scriptable forums Good for spam, PageRank Found 46 applicable regular expressions Most IPs show transient behavior: probably bots All regular expression groups show at least one group similarity feature IPs got less aggressive over time: more stealthy Queries for Forum Spamming (cont’d) Validation Project Honey Pot Dynamically generate email address for each visiting IP E-mail received: must be spam 12% of all IPs listed (vs. 0.5% for normal IPs) Applications Use queries to find and clean targeted pages Deny results to malicious queries Phishing via Windows Live Messenger Queries triggered by normal users Victim receives message from a contact Follow link for party photos Taken to fake WLM login After giving credentials, redirected to Bing search for “party” Bing search to avoid costs of hosting Phishing via WLM (cont’d) Detect via query referral field (source page) Found two regular expressions for referrals Both expressions: victim username embedded in URL Over 180 phishing domains for 12 IPs detected Compromised accounts show different login behaviors Conclusion Presented framework for finding suspicious queries Input: search logs, small set of seed queries Output: regular expressions, millions of suspicious queries Analyzed suspicious queries Identified possible attacks Suggested means of prevention Generally: attempted to demonstrate relationship between suspicious queries and the possibility of attack