18th ACM Conference on Computer and Communications Security
(CCS 2011)
Cloak and Dagger: Dynamics of Web
Search Cloaking
David Y. Wang, Stefan Savage, and Geoffrey
M. Voelker
University of California, San Diego
左昌國
Seminar @ ADLab, NCU-CSIE
2
Outline
• Introduction
• Methodology
• Results
• Related Work
• Conclusion
3
Introduction
• Search Engine Optimization (SEO)
• “Search engine optimization (SEO) is the process of improving the
visibility of a website or a web page in search engines via the
"natural" or un-paid ("organic" or "algorithmic") search results.” --Wikipedia
• SEO could be used as benign techniques
• Cloaking
• Up to 1999
• One of the notorious blackhat SEO skills
• Delivering different content to different user segments
• ie. Search engine crawlers and normal users
4
Introduction
Normal User
Search Engine Crawler
5
Introduction
• Types of cloaking
• Repeat Cloaking
• Cookies or IP tracking
• User Agent Cloaking
• User-Agent field in the HTTP request header
• Referrer Cloaking
• Referer field in the HTTP header
• IP Cloaking
6
Introduction
• This paper…
• Designs a system, Dagger, to identify cloaking in near real-time
• Uses this system to
• Provide a picture of cloaking activity as seen through three search
engines(Google, Bing and Yahoo)
• Characterize the differences in cloaking behavior between
undifferentiated “trending” keywords and targeted keywords.
• Characterize the dynamic behavior of cloaking activity
7
Methodology
• Dagger consists of five functional components
• Collecting search terms
• Fetching search results from search engines
• Crawling the pages linked from the search results
• Analyzing the pages crawled
• Repeating measurements over time
8
Methodology
• Collecting Search Terms
• Collecting popular search terms from
• Google Hot Searches
• Alexa
• Twitter
• Constructing another source of search terms using keyword
suggestions from “Google Suggest.”
• ex: User enter -> viagra 50mg
Suggestion -> viagra 50mg cost
viagra 50mg canada
…
9
Methodology
• Querying Search Results
• Submitting the search terms to search engines(Google, Yahoo, and
Bing)
• Google Hot Searches and Alexa each supply 80 terms per 4-hour
• Twitter supplies 40
• Together with 240 additional suggestions based on Google Hot
Searches (80 * 3)
Total 440 terms
• Extracting the top 100 search results for each search term(44,000)
• Removing whitelist URLs
• Grouping similar entries (same URL, source, and search term)
average roughly 15,000 unique URLs in each measurement period
10
Methodology
• Crawling Search Results
• Web crawler
• A Java web crawler using the HttpClient 3.x package from Apache
• Crawling 3 times for each URL
• Disguised as a normal user using Internet Explorer, clicking through the
search result
• Disguised as the Googlebot Web crawler
• Disguised as a normal user again, NOT clicking through the search
result
• Dealing with IP cloaking?
• Fourth crawling using Google Translate
• More than half of cloaked results do IP cloaking
11
Methodology
• Detecting Cloaking
• Removing HTTP error response (average 4% of URLs)
• Using Text Shingling to filter out nearly identical pages
• 90% of URLs are near duplicates ( “near duplicates” means 10% or less
differences between 2 sets of signatures)
• Measuring the similarity between the snippet of the search result
and the user view of the page
• Removing noise from both the snippet and the body of the user view
• Search substrings from the snippet
• Number of words from unmatched substrings divided by the total
number of words from all substrings
• 1.0 means no match
• 0.0 means fully match
• Threshold: 0.33  filter out 56% of the remaining URLs
12
Methodology
• Detecting Cloaking(cont.)
• False positives may still exist
• Examining the DOMs as the final test
• Computing the sum of an overall comparison and a hierarchical
comparison
• Overall comparison: unmatched tags from the entire page divided by the
total number of tags
• Hierarchical comparison: the sum of the unmatched tags from each
level of the DOM hierarchy divided by the total number of tags
• 2.0 means no match
• 0.0 means fully match
• Threshold: 0.66
13
Methodology
• Detecting Cloaking(cont.)
• Manual inspection
• False positive: 9.1% (29 of 317) in Google search
12% (9 of 75) in Yahoo
(benign websites but delivering different content to search engines)
• Advanced browser detection
• Temporal Remeasurement
• Dagger remeasures every 4 hours for up to 7 days
14
Results
• Cloaking Over Time
15
Results
16
Results
• Sources of Search Terms
17
Results
18
Results
19
Results
20
Results
• Search Engine Response
21
Results
22
Results
23
Results
24
Results
25
Results
• Cloaking Duration
26
Results
• Cloaked Content
27
Results
28
Results
• Domain Infrastructure
29
Results
• SEO
30
Conclusion
• Cloaking is an standard skill of constructing scam pages.
• This paper examined the current state of search engine
cloaking as used to support Web spam.
• New techniques for identifying cloaking(via the search
engine snippets that identify keyword-related content
found at the time of crawling)
• Exploring the dynamics of cloaked search results and
sites over time.
Download

Search engine optimization (SEO)