Securing Web Service by Automatic Robot Detection KyoungSoo Park, Vivek S. Pai Princeton University Kang-Won Lee, Seraphin Calo IBM T.J. Watson Research Center Web Robots • Automatic agents • Web crawlers • URL link checkers • Malicious robots are widespread • • • • Password cracking Referrer/Blog spamming Click frauds on Google search Burning CPU with heavy CGI queries KyoungSoo Park USENIX 2006 2 Contributions • Real-time robot detector • Fast detection • 80% at 20 reqs, 95% at 57 reqs • High accuracy • 2.4% max false positive rate • Low overhead • ~200 usec additional delay per page • Easy deployment KyoungSoo Park USENIX 2006 3 Operational Scenario • Server-side • Site Webserver • Many-to-one • Client-side • Firewall/Proxies at LAN • Many-to-many KyoungSoo Park Server infrastructure Clients MON Servers Client infrastructure USENIX 2006 4 Design Goals • Transparency • No human intervention • Accuracy • Minimal false positives • Real-time proof • Periodic check should be possible • Authentication or CAPTCHA not enough • Practicality KyoungSoo Park USENIX 2006 5 Observation & Intuition Robot behavior • Custom program • Goal-oriented Human behavior • Standard browsers • Browsing purpose • No embedded objs • No index file • Follow hidden links • No HW events • Cascading style sheets • Images • Never follow hidden links • Mouse & keyboard Humans are easier to detect KyoungSoo Park USENIX 2006 6 Browser Detection • “No standard browser”(implies) robot • “User-Agent” HTTP header? • Use behavioral artifacts (dynamic mods) • Redundant embedded objects • Empty cascading style sheet (CSS) • Invisible images (1x1 JPEG) or mute sounds • Hidden links KyoungSoo Park USENIX 2006 7 Human Activity Detection • Human activities (implies) human • Mouse/keyboar d event tracking • Most robots don’t generate HW events • Dynamically embed JavaScript code • MouseMove triggers the event handler • Event handler fetches a fake image • Semantically & lexically obfuscated KyoungSoo Park USENIX 2006 8 Test with CoDeeN • CoDeeN (http://codeen.cs.princeton.edu/) • Pulling-based CDN on PlanetLab over 3 years • 25+ million reqs from 50K clients/day • Malicious robots seeking abuse • Results for 1-week measurement • But changes now permanent KyoungSoo Park USENIX 2006 9 Main Result JavaScript Exec 27.1% CSS Fetch 28.9% MouseMove 22.3% Robots 71.1% KyoungSoo Park JS but No MouseMove Robots Not sure, but human Potential FP, 1.9% USENIX 2006 10 Main Result Max False Positive Rate = FP/negatives = /Robots = 1.9/77.7 = 2.4% CSS Fetch 28.9% Only 9% passed (optional) CAPTCHA Robots 71.1% KyoungSoo Park Only 0.9% followed hidden links USENIX 2006 11 Fraction of sessions detected in X reqs (CDF) How Fast Can We Detect? 1 0.9 0.8 0.7 80% 20 reqs 0.6 95% 57 reqs 0.5 0.4 0.3 CSS file 0.2 JavaScript files 0.1 Mouse events 0 1 KyoungSoo Park 21 41 61 Number of Requests Required to Detect USENIX 2006 81 12 # of CoDeeN Complaints # of CoDeeN Abuse Complaints 10 9 8 7 6 5 4 3 2 1 0 Browser Detection Human Activity Detection Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Months in 2005/2006 KyoungSoo Park USENIX 2006 13 Limitations • Defeating browser detection • Behave exactly like a standard browser • Human activity detection • Robots generating mouse/key events • Disable JavaScript – 4% • Solution • Ensemble techniques KyoungSoo Park USENIX 2006 14 Accuracy(%) Machine Learning (AdaBoost) 96 95 94 93 92 91 90 89 88 Train set Test set Three most effective attributes 1. RESPONSE CODE 300% 2. REFERRER % 3. UNSEEN REFERRER % Drawbacks: 1. Heavy computation/memory 2. Pattern may change 3. Human intervention 20 40 60 80 100 120 140 160 # of Requests KyoungSoo Park USENIX 2006 15 Conclusions • Practical robot detection tool • Detect human by • Standard browser behavior • Human activities • “Arms Race” in the end • Turing test • Most simple bots screened out • Ensemble techniques promising KyoungSoo Park USENIX 2006 16