JYXAK'10_Slides

advertisement
Searching the Searchers
with SearchAudit
JOHN P. JOHN
FANG YU
YINGLIAN XIE
MARTÍN ABADI
ARVIND KRISHNAMURTHY
PRESENTATION BY SAM KLOCK
Motivation
We can find this via a Google search
Motivation (cont’d)
 Search engines open opportunities for attackers
 Construct clever queries
 Find vulnerable sites
 Plant malware; spam (e.g., MyDoom)
 Do so stealthily and cheaply
 Mitigation strategy: identify malicious queries
 May be able to deny results to user
 Identify attackers (probably bots)
 Interpret strategy, then anticipate and prevent
 The question: how to do so
Proposed Approach
 SearchAudit

Framework for generating
malicious queries
 Input:


Seed set of known
malicious queries
Search logs
 Output:


Large set of suspicious
queries
Regular expressions
matching queries
Seed set
Search logs
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
ext:pl inurl:cgi
intitle:"FormMail *"
-"*Referrer" -"*
Denied" -sourceforge
-error -cvs -input
filetype:cgi
inurl:tseekdir.cgi
...
SearchAudit
inurl:gotoURL.asp?url=
inurl:gotoURL.asp?url=
filetype:asp
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
inurl:"shopdisplayprod
ucts.asp"
ext:pl
inurl:cgi
ucts.asp"
ext:pl
inurl:cgi
intitle:"FormMail
ext:pl
inurl:cgi *"*"
intitle:"FormMail
-"*Referrer"
-"*
intitle:"FormMail
*"
-"*Referrer"
-"*
Denied"
-sourceforge
-"*Referrer"
-"*
Denied"
-sourceforge
-error
-cvs-sourceforge
-input
Denied"
-error
-cvs -input
filetype:cgi
-error
-cvs
-input
filetype:cgi
inurl:tseekdir.cgi
filetype:cgi
inurl:tseekdir.cgi
...
inurl:tseekdir.cgi
...
...
Expanded set
"/includes/joomla\.php
" site:\.[a-zAZ]{2,3}
"/includes/class_item\
.php"
site:[^?=#+@;&:]{2,
4}
"php-nuke"
site:[^?=#+@;&:]{2,
4}
"modules\.php\?op=modl
oad" site:\.[a-zAZ0-9]{2,6}
Regular expressions
Proposed Approach (cont’d)
 Needed to implement:
 Seed set: milw0rm.com
 Search logs: Microsoft Research  Bing
 Way to expand seed set into more queries
 Way to infer regular expressions
 Intended benefits:
 Harvesting lots of information




Three months: ~1.2 TB of logs
Interpret relationship between queries and attacks
Use queries to find potential victims
Stop attacks
SearchAudit
Query
identification
Query analysis
Query Identification: Expansion
 Basic idea: bootstrap on
seed set



Search logs for exact
matches to seed queries
Record IPs of hosts
making seed queries
Add other queries from
those IPs to set

Intuition: make one
malicious query, will
probably make more
 Account for DHCP
Seed queries
Log search
IP addresses
Queries
made on
same day
Queries made
by IPs
Query Identification: Regular Expressions
 Goals:
 Account for variation in
queries
 Take advantage of scripting
 See paper for generation
algorithm
 Compute score for
generated expressions


Lower score: more specific
Goal: discard overly general
expressions (score > 0.6)
 Consolidate to avoid
overlap
 Avoid proxies, public NAT
for performance
 Loopback for more queries
Query Identification: Results
 Data from Bing and milw0rm
 500 queries
 Logs for Feb. 2009, Dec. 2009, Jan. 2010

~2 billion views per month
 System implemented on Dryad/DryadLINQ
 Initial observations:
 Using specificity scores < 0.6
seems to be effective


Based on cookie heuristic
Proxy elimination does not limit
results
Query Identification: Results (cont’d)
 Query expansion:
 122 of 500 queries
matched in logs: 174
unique IPs
 Expanded to 800 unique
queries, 264 IPs
 Regular expressions
matched 3,560 queries,
1,001 IPs
 Incomplete seeds
 Tried with subsets of
original set
 Coverage still good
Query Identification: Results (cont’d)
 Loopback:


Multiple loopbacks got
more results
One iteration is good
enough
 Overall statistics


10,000s IPs each month
100,000s unique queries
each month

Dec. 09: set of unusual
attacker IPs cause spike
Query Identification: Verification
 Want to show queries are
malicious


Sometimes easy: 73% of
queries associated with
security/hacker sites
What about others?
 Individual bots


 Groups of bots

 No ground truth exists


Individual level (one IP)
Group level (multiple IPs)
Data often fixed by
botnets
User agent string
 Metadata for requests

 So: look for bot-like
features
New cookie
Whether a link was clicked

Tendencies dictated by
scripts
Pages viewed per query
 Time between queries

Query Identification: Verification (cont’d)
Substantial variation between host behavior for
normal queries and suspicious queries
Observations on Stage One
 Regular expressions can become obsolete
 Just need fresh logs and a new seed to get new ones
 Attacker awareness of technique yields adaptation
 Example: mix in normal user queries
Goal: trick SearchAudit into identifying as proxy
 Hard to do: needs to be appropriate to time and place
 Anyway: proxy elimination is optimization only



Injecting randomness also possible, but makes querying less
productive
Could obviate cookie heuristic, but it is replaceable
 All attackers need to be careful to succeed
Query Analysis
Query Analysis
 42,000 IPs gave suspicious queries globally
 U.S., Russia, China contribute almost 50%
 10% of IPs gave 90% of queries
 Found 200 regular expressions
 Reveal three kinds of attack-related queries:
 Vulnerable web sites
 Forum spamming
 Phishing on Windows Live Messenger
Queries for Vulnerable Websites
 Queries look for exploitable
inurl:index.php?content=X
server vulnerabilities


http://www.example.com/ind
ex.php?content=X’%20OR%20’
1’%20OR%20‘1=1’
GET variables embedded in
URL (for SQL injection)
Server software with known
vulnerabilities (e.g., status
pages)
 SearchAudit as a defense:
 Pull suspicious queries for
vulnerabilities
 Run queries; gather results
 Inspect results for
vulnerabilities
 Notify sites of vulnerabilities
Queries for Vulnerable Websites (cont’d)
 With identified queries:
 Sampled 5,000 queries
 Obtained 80,490 URLs from
39,475 sites
 Compared to
malware/phishing lists:


3-4% on anti-phishing lists
1.5% on anti-malware lists
 SQL injection vulnerability:
 Add a single-quote to
variable in URL
 Look for SQL error
 12% of examined URLs
showed an error
Queries for Forum Spamming
 Query motivation:
 Find scriptable forums
 Good for spam, PageRank
 Found 46 applicable
regular expressions
 Most IPs show transient
behavior: probably bots

All regular expression
groups show at least one
group similarity feature
 IPs got less aggressive
over time: more stealthy
Queries for Forum Spamming (cont’d)
 Validation
 Project Honey Pot
Dynamically generate email address for each
visiting IP
 E-mail received: must be
spam


12% of all IPs listed (vs.
0.5% for normal IPs)
 Applications
 Use queries to find and
clean targeted pages
 Deny results to malicious
queries
Phishing via Windows Live Messenger
 Queries triggered by
normal users

Victim receives message
from a contact



Follow link for party
photos
Taken to fake WLM login
After giving credentials,
redirected to Bing search
for “party”
 Bing search to avoid
costs of hosting
Phishing via WLM (cont’d)
 Detect via query referral
field (source page)


Found two regular
expressions for referrals
Both expressions: victim
username embedded in
URL
 Over 180 phishing
domains for 12 IPs
detected
 Compromised accounts
show different login
behaviors
Conclusion
 Presented framework for finding suspicious queries
 Input: search logs, small set of seed queries
 Output: regular expressions, millions of suspicious queries
 Analyzed suspicious queries
 Identified possible attacks
 Suggested means of prevention
 Generally: attempted to demonstrate relationship
between suspicious queries and the possibility of
attack
Download