Securing Web Service by Automatic Robot Detection KyoungSoo Park , Vivek S. Pai

advertisement
Securing Web Service by
Automatic Robot Detection
KyoungSoo Park, Vivek S. Pai
Princeton University
Kang-Won Lee, Seraphin Calo
IBM T.J. Watson Research Center
Web Robots
• Automatic agents
• Web crawlers
• URL link checkers
• Malicious robots are widespread
•
•
•
•
Password cracking
Referrer/Blog spamming
Click frauds on Google search
Burning CPU with heavy CGI queries
KyoungSoo Park
USENIX 2006
2
Contributions
• Real-time robot detector
• Fast detection
• 80% at 20 reqs, 95% at 57 reqs
• High accuracy
• 2.4% max false positive rate
• Low overhead
• ~200 usec additional delay per page
• Easy deployment
KyoungSoo Park
USENIX 2006
3
Operational Scenario
• Server-side
• Site Webserver
• Many-to-one
• Client-side
• Firewall/Proxies
at LAN
• Many-to-many
KyoungSoo Park
Server infrastructure
Clients
MON
Servers
Client infrastructure
USENIX 2006
4
Design Goals
• Transparency
• No human intervention
• Accuracy
• Minimal false positives
• Real-time proof
• Periodic check should be possible
• Authentication or CAPTCHA not enough
• Practicality
KyoungSoo Park
USENIX 2006
5
Observation & Intuition
Robot behavior
• Custom program
• Goal-oriented
Human behavior
• Standard browsers
• Browsing purpose
• No embedded objs
• No index file
• Follow hidden links
• No HW events
• Cascading style sheets
• Images
• Never follow hidden links
• Mouse & keyboard
Humans are easier to detect
KyoungSoo Park
USENIX 2006
6
Browser Detection
• “No standard browser”(implies) robot
• “User-Agent” HTTP header?
• Use behavioral artifacts (dynamic mods)
• Redundant embedded objects
• Empty cascading style sheet (CSS)
• Invisible images (1x1 JPEG) or mute sounds
• Hidden links
KyoungSoo Park
USENIX 2006
7
Human Activity Detection
• Human activities (implies) human
• Mouse/keyboar d event tracking
• Most robots don’t generate HW events
• Dynamically embed JavaScript code
• MouseMove triggers the event handler
• Event handler fetches a fake image
• Semantically & lexically obfuscated
KyoungSoo Park
USENIX 2006
8
Test with CoDeeN
• CoDeeN (http://codeen.cs.princeton.edu/)
• Pulling-based CDN on PlanetLab over 3 years
• 25+ million reqs from 50K clients/day
• Malicious robots seeking abuse
• Results for 1-week measurement
• But changes now permanent
KyoungSoo Park
USENIX 2006
9
Main Result
JavaScript Exec
27.1%
CSS Fetch
28.9%
MouseMove
22.3%
Robots
71.1%
KyoungSoo Park
JS but No MouseMove
Robots
Not sure, but human
Potential FP, 1.9%
USENIX 2006
10
Main Result
Max False Positive Rate
= FP/negatives
= /Robots
= 1.9/77.7 = 2.4%
CSS Fetch
28.9%
Only 9% passed (optional) CAPTCHA
Robots
71.1%
KyoungSoo Park
Only 0.9% followed hidden links
USENIX 2006
11
Fraction of sessions detected in X
reqs (CDF)
How Fast Can We Detect?
1
0.9
0.8
0.7
80% 20 reqs
0.6
95% 57 reqs
0.5
0.4
0.3
CSS file
0.2
JavaScript files
0.1
Mouse events
0
1
KyoungSoo Park
21
41
61
Number of Requests Required to Detect
USENIX 2006
81
12
# of CoDeeN Complaints
# of CoDeeN Abuse
Complaints
10
9
8
7
6
5
4
3
2
1
0
Browser Detection
Human Activity Detection
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
Months in 2005/2006
KyoungSoo Park
USENIX 2006
13
Limitations
• Defeating browser detection
• Behave exactly like a standard browser
• Human activity detection
• Robots generating mouse/key events
• Disable JavaScript – 4%
• Solution
• Ensemble techniques
KyoungSoo Park
USENIX 2006
14
Accuracy(%)
Machine Learning (AdaBoost)
96
95
94
93
92
91
90
89
88
Train set
Test set
Three most effective attributes
1. RESPONSE CODE 300%
2. REFERRER %
3. UNSEEN REFERRER %
Drawbacks:
1. Heavy computation/memory
2. Pattern may change
3. Human intervention
20
40
60
80 100 120 140 160
# of Requests
KyoungSoo Park
USENIX 2006
15
Conclusions
• Practical robot detection tool
• Detect human by
• Standard browser behavior
• Human activities
• “Arms Race” in the end
• Turing test
• Most simple bots screened out
• Ensemble techniques promising
KyoungSoo Park
USENIX 2006
16
Download