JYXAK'10_Slides

Searching the Searchers with SearchAudit JOHN P. JOHN FANG YU YINGLIAN XIE MARTÍN ABADI ARVIND KRISHNAMURTHY PRESENTATION BY SAM KLOCK Motivation We can find this via a Google search Motivation (cont’d)  Search engines open opportunities for attackers  Construct clever queries  Find vulnerable sites  Plant malware; spam (e.g., MyDoom)  Do so stealthily and cheaply  Mitigation strategy: identify malicious queries  May be able to deny results to user  Identify attackers (probably bots)  Interpret strategy, then anticipate and prevent  The question: how to do so Proposed Approach  SearchAudit  Framework for generating malicious queries  Input:   Seed set of known malicious queries Search logs  Output:   Large set of suspicious queries Regular expressions matching queries Seed set Search logs inurl:gotoURL.asp?url= filetype:asp inurl:"shopdisplayprod ucts.asp" ext:pl inurl:cgi intitle:"FormMail *" -"*Referrer" -"* Denied" -sourceforge -error -cvs -input filetype:cgi inurl:tseekdir.cgi ... SearchAudit inurl:gotoURL.asp?url= inurl:gotoURL.asp?url= filetype:asp inurl:gotoURL.asp?url= filetype:asp inurl:"shopdisplayprod filetype:asp inurl:"shopdisplayprod ucts.asp" inurl:"shopdisplayprod ucts.asp" ext:pl inurl:cgi ucts.asp" ext:pl inurl:cgi intitle:"FormMail ext:pl inurl:cgi *"*" intitle:"FormMail -"*Referrer" -"* intitle:"FormMail *" -"*Referrer" -"* Denied" -sourceforge -"*Referrer" -"* Denied" -sourceforge -error -cvs-sourceforge -input Denied" -error -cvs -input filetype:cgi -error -cvs -input filetype:cgi inurl:tseekdir.cgi filetype:cgi inurl:tseekdir.cgi ... inurl:tseekdir.cgi ... ... Expanded set "/includes/joomla\.php " site:\.[a-zAZ]{2,3} "/includes/class_item\ .php" site:[^?=#+@;&:]{2, 4} "php-nuke" site:[^?=#+@;&:]{2, 4} "modules\.php\?op=modl oad" site:\.[a-zAZ0-9]{2,6} Regular expressions Proposed Approach (cont’d)  Needed to implement:  Seed set: milw0rm.com  Search logs: Microsoft Research  Bing  Way to expand seed set into more queries  Way to infer regular expressions  Intended benefits:  Harvesting lots of information     Three months: ~1.2 TB of logs Interpret relationship between queries and attacks Use queries to find potential victims Stop attacks SearchAudit Query identification Query analysis Query Identification: Expansion  Basic idea: bootstrap on seed set    Search logs for exact matches to seed queries Record IPs of hosts making seed queries Add other queries from those IPs to set  Intuition: make one malicious query, will probably make more  Account for DHCP Seed queries Log search IP addresses Queries made on same day Queries made by IPs Query Identification: Regular Expressions  Goals:  Account for variation in queries  Take advantage of scripting  See paper for generation algorithm  Compute score for generated expressions   Lower score: more specific Goal: discard overly general expressions (score > 0.6)  Consolidate to avoid overlap  Avoid proxies, public NAT for performance  Loopback for more queries Query Identification: Results  Data from Bing and milw0rm  500 queries  Logs for Feb. 2009, Dec. 2009, Jan. 2010  ~2 billion views per month  System implemented on Dryad/DryadLINQ  Initial observations:  Using specificity scores < 0.6 seems to be effective   Based on cookie heuristic Proxy elimination does not limit results Query Identification: Results (cont’d)  Query expansion:  122 of 500 queries matched in logs: 174 unique IPs  Expanded to 800 unique queries, 264 IPs  Regular expressions matched 3,560 queries, 1,001 IPs  Incomplete seeds  Tried with subsets of original set  Coverage still good Query Identification: Results (cont’d)  Loopback:   Multiple loopbacks got more results One iteration is good enough  Overall statistics   10,000s IPs each month 100,000s unique queries each month  Dec. 09: set of unusual attacker IPs cause spike Query Identification: Verification  Want to show queries are malicious   Sometimes easy: 73% of queries associated with security/hacker sites What about others?  Individual bots    Groups of bots   No ground truth exists   Individual level (one IP) Group level (multiple IPs) Data often fixed by botnets User agent string  Metadata for requests   So: look for bot-like features New cookie Whether a link was clicked  Tendencies dictated by scripts Pages viewed per query  Time between queries  Query Identification: Verification (cont’d) Substantial variation between host behavior for normal queries and suspicious queries Observations on Stage One  Regular expressions can become obsolete  Just need fresh logs and a new seed to get new ones  Attacker awareness of technique yields adaptation  Example: mix in normal user queries Goal: trick SearchAudit into identifying as proxy  Hard to do: needs to be appropriate to time and place  Anyway: proxy elimination is optimization only    Injecting randomness also possible, but makes querying less productive Could obviate cookie heuristic, but it is replaceable  All attackers need to be careful to succeed Query Analysis Query Analysis  42,000 IPs gave suspicious queries globally  U.S., Russia, China contribute almost 50%  10% of IPs gave 90% of queries  Found 200 regular expressions  Reveal three kinds of attack-related queries:  Vulnerable web sites  Forum spamming  Phishing on Windows Live Messenger Queries for Vulnerable Websites  Queries look for exploitable inurl:index.php?content=X server vulnerabilities   http://www.example.com/ind ex.php?content=X’%20OR%20’ 1’%20OR%20‘1=1’ GET variables embedded in URL (for SQL injection) Server software with known vulnerabilities (e.g., status pages)  SearchAudit as a defense:  Pull suspicious queries for vulnerabilities  Run queries; gather results  Inspect results for vulnerabilities  Notify sites of vulnerabilities Queries for Vulnerable Websites (cont’d)  With identified queries:  Sampled 5,000 queries  Obtained 80,490 URLs from 39,475 sites  Compared to malware/phishing lists:   3-4% on anti-phishing lists 1.5% on anti-malware lists  SQL injection vulnerability:  Add a single-quote to variable in URL  Look for SQL error  12% of examined URLs showed an error Queries for Forum Spamming  Query motivation:  Find scriptable forums  Good for spam, PageRank  Found 46 applicable regular expressions  Most IPs show transient behavior: probably bots  All regular expression groups show at least one group similarity feature  IPs got less aggressive over time: more stealthy Queries for Forum Spamming (cont’d)  Validation  Project Honey Pot Dynamically generate email address for each visiting IP  E-mail received: must be spam   12% of all IPs listed (vs. 0.5% for normal IPs)  Applications  Use queries to find and clean targeted pages  Deny results to malicious queries Phishing via Windows Live Messenger  Queries triggered by normal users  Victim receives message from a contact    Follow link for party photos Taken to fake WLM login After giving credentials, redirected to Bing search for “party”  Bing search to avoid costs of hosting Phishing via WLM (cont’d)  Detect via query referral field (source page)   Found two regular expressions for referrals Both expressions: victim username embedded in URL  Over 180 phishing domains for 12 IPs detected  Compromised accounts show different login behaviors Conclusion  Presented framework for finding suspicious queries  Input: search logs, small set of seed queries  Output: regular expressions, millions of suspicious queries  Analyzed suspicious queries  Identified possible attacks  Suggested means of prevention  Generally: attempted to demonstrate relationship between suspicious queries and the possibility of attack

JYXAK'10_Slides

Related documents

Products

Support

JYXAK'10_Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib