18th ACM Conference on Computer and Communications Security (CCS 2011) Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar @ ADLab, NCU-CSIE 2 Outline • Introduction • Methodology • Results • Related Work • Conclusion 3 Introduction • Search Engine Optimization (SEO) • “Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results.” --Wikipedia • SEO could be used as benign techniques • Cloaking • Up to 1999 • One of the notorious blackhat SEO skills • Delivering different content to different user segments • ie. Search engine crawlers and normal users 4 Introduction Normal User Search Engine Crawler 5 Introduction • Types of cloaking • Repeat Cloaking • Cookies or IP tracking • User Agent Cloaking • User-Agent field in the HTTP request header • Referrer Cloaking • Referer field in the HTTP header • IP Cloaking 6 Introduction • This paper… • Designs a system, Dagger, to identify cloaking in near real-time • Uses this system to • Provide a picture of cloaking activity as seen through three search engines(Google, Bing and Yahoo) • Characterize the differences in cloaking behavior between undifferentiated “trending” keywords and targeted keywords. • Characterize the dynamic behavior of cloaking activity 7 Methodology • Dagger consists of five functional components • Collecting search terms • Fetching search results from search engines • Crawling the pages linked from the search results • Analyzing the pages crawled • Repeating measurements over time 8 Methodology • Collecting Search Terms • Collecting popular search terms from • Google Hot Searches • Alexa • Twitter • Constructing another source of search terms using keyword suggestions from “Google Suggest.” • ex: User enter -> viagra 50mg Suggestion -> viagra 50mg cost viagra 50mg canada … 9 Methodology • Querying Search Results • Submitting the search terms to search engines(Google, Yahoo, and Bing) • Google Hot Searches and Alexa each supply 80 terms per 4-hour • Twitter supplies 40 • Together with 240 additional suggestions based on Google Hot Searches (80 * 3) Total 440 terms • Extracting the top 100 search results for each search term(44,000) • Removing whitelist URLs • Grouping similar entries (same URL, source, and search term) average roughly 15,000 unique URLs in each measurement period 10 Methodology • Crawling Search Results • Web crawler • A Java web crawler using the HttpClient 3.x package from Apache • Crawling 3 times for each URL • Disguised as a normal user using Internet Explorer, clicking through the search result • Disguised as the Googlebot Web crawler • Disguised as a normal user again, NOT clicking through the search result • Dealing with IP cloaking? • Fourth crawling using Google Translate • More than half of cloaked results do IP cloaking 11 Methodology • Detecting Cloaking • Removing HTTP error response (average 4% of URLs) • Using Text Shingling to filter out nearly identical pages • 90% of URLs are near duplicates ( “near duplicates” means 10% or less differences between 2 sets of signatures) • Measuring the similarity between the snippet of the search result and the user view of the page • Removing noise from both the snippet and the body of the user view • Search substrings from the snippet • Number of words from unmatched substrings divided by the total number of words from all substrings • 1.0 means no match • 0.0 means fully match • Threshold: 0.33 filter out 56% of the remaining URLs 12 Methodology • Detecting Cloaking(cont.) • False positives may still exist • Examining the DOMs as the final test • Computing the sum of an overall comparison and a hierarchical comparison • Overall comparison: unmatched tags from the entire page divided by the total number of tags • Hierarchical comparison: the sum of the unmatched tags from each level of the DOM hierarchy divided by the total number of tags • 2.0 means no match • 0.0 means fully match • Threshold: 0.66 13 Methodology • Detecting Cloaking(cont.) • Manual inspection • False positive: 9.1% (29 of 317) in Google search 12% (9 of 75) in Yahoo (benign websites but delivering different content to search engines) • Advanced browser detection • Temporal Remeasurement • Dagger remeasures every 4 hours for up to 7 days 14 Results • Cloaking Over Time 15 Results 16 Results • Sources of Search Terms 17 Results 18 Results 19 Results 20 Results • Search Engine Response 21 Results 22 Results 23 Results 24 Results 25 Results • Cloaking Duration 26 Results • Cloaked Content 27 Results 28 Results • Domain Infrastructure 29 Results • SEO 30 Conclusion • Cloaking is an standard skill of constructing scam pages. • This paper examined the current state of search engine cloaking as used to support Web spam. • New techniques for identifying cloaking(via the search engine snippets that identify keyword-related content found at the time of crawling) • Exploring the dynamics of cloaked search results and sites over time.