Strider Search Ranger: Towards an Autonomic Anti-Spam Search Engine Yi-Min Wang and Ming Ma Cybersecurity and Systems Management Group Microsoft Research, Redmond, WA {ymwang, mingma}@microsoft.com number of websites artificially linking to each other), insert hidden links into cloned websites [5], and perform comment spamming by injecting spam links into the comment fields of publicly accessible “forums”1 [6]. In this paper, we propose a new approach to combat large-scale search spam. In contrast with the existing information retrieval-centric approach of applying content analysis to all crawler-indexed pages, we model the problem as a security problem and apply traffic analysis to true user-seen pages presented by search results of spammer-targeted commerce queries. By analogy to the physical world, we do not attempt to separate criminals from the innocent majority by lining up billions of people and trying to identify how today’s criminals typically dress themselves. Instead, we start by gathering intelligence on where the bad neighborhoods are, put those who are particularly active in those neighborhoods under surveillance, capture them as they conduct criminal acts in the crime scene, interrogate them to collect more information about their networks, and then hunt down upstream and downstream miscreants who do business with them. Our approach is “autonomic” [7] in that it uses selfmonitoring of search results to detect large-scale spammers who have successfully defeated the current anti-spam solutions, and then fights back by addressing the specific weakness in the ranking algorithms and performing targeted spam patrol and hunting for self-protection. Specifically, we focus on patrolling a group of search results that are more likely to have a high density of spam links, and try to detect anomalies of correlations that indicate the presence of large-scale spammers who are occupying a large number of search results. Based on that information, we strengthen the search ranking algorithms to broadly defend against similar attacks in the future, and perform targeted hunting of related spam pages to clean up existing damages. We have implemented an autonomic anti-spam system called Strider Search Ranger and tested it against an important class of search spam – the redirection spam – as the first demonstration of the new approach. Redirection Abstract Search spammers use questionable search engine optimization techniques to promote their spam links into top search results. Large-scale spammers target commerce queries that they can monetize and attempt to spam as many top search results of those queries as possible. We model the large-scale search spam problem as that of defending against correlated attacks on search rankings across multiple keywords, and propose an autonomic antispam approach based on self-monitoring and selfprotection. In this new approach, search engines monitor and correlate their own search results of spammer-targeted keywords to detect large-scale spam attacks that have successfully bypassed their current anti-spam solutions. They then initiate self-protection through targeted patrol of spam-heavy domains, targeted hunting at the sources of successful spam, and strengthening of specific weakness in the ranking algorithms. We describe the Strider Search Ranger system which implements this new approach, and focus on its use to defend against an important class of search spam – the redirection spam – as a demonstration of the general concept. We evaluate the system by testing it against actual search results and show that it can detect useful spam patterns and eliminate a significant amount of spam for all three major search engines. 1. Introduction Search spammers (or web spammers) refer to those who use questionable search engine optimization techniques to promote their links into top search positions they do not deserve. Search spam has traditionally been modeled as an information retrieval and relevance ranking problem: the content of each web page and the hyperlinking relationship between pages are analyzed to determine page ranking [1], and the hope is that spam pages will naturally have lower ranks and so appear after all non-spam pages in search results. However, large-scale search spammers have successfully attacked the relevance-based, anti-spam solutions by providing both bogus content and bogus links. To provide bogus content, they use crawler-browser cloaking [2,3,4] by serving a non-spam page designed to gain a high relevance score to the crawlers, but displaying spam content designed to maximize potential profit to end users. To provide bogus links, they create “link farms” (i.e., large 1 1 For ease of presentation, throughout the paper, we use the term “forums” to include all blogs, bulletin boards, message boards, guest books, web journals, diaries, galleries, archives, etc. that can be abused by web spammers to promote spam URLs. spam refers to those spam pages that can be characterized by the third-party domains they generate redirection traffic to. It is a particularly challenging problem that is plaguing all three major search engines because it often involves large-scale attacks and it typically uses cloaking techniques to fool content-based analysis. The use of redirection is becoming essential to a big part of the search spam business that consists of traffic-affiliate spammers, who participate directly in merchant websites’ affiliate programs, and syndication-based spammers, who participate in pay-per-click, advertising syndication programs and display ads-portal pages. In the affiliate model, the need for spam pages to redirect to their target merchant sites is clear. In the syndication model, many large-scale spammers have moved to the practice of setting up “throw-away” doorway pages on legitimate websites to avoid exposing their own domains to blacklisting by search engines; for example, free blog-hosting sites such as blogspot.com and free web-hosting sites such as hometown.aol.com are popular among spammers [8]. Since spammers do not own these servers, they typically use client-side scripts to redirect browsers to fetch ads from redirection domains that they own. The Strider Search Ranger system combats redirection spam by monitoring spammer-targeted keywords and identifying major doorway domains and redirection domains that have successfully spammed those keywords. It then directs machine resources accordingly to hunt for more spam pages that are associated with these successful spammers and very likely have spammed other keywords as well. The system also provides concrete feedback on the weakness of the current content- and link-based ranking algorithms to help strengthening them against spam. The paper is organized as follows. Section 2 describes a few actual redirection-spam examples to set the context. Section 3 presents the overall architecture of the Search Ranger system and describes the self-monitoring and selfprotection subsystems. Section 4 evaluates the effectiveness of the system by demonstrating how they can actually improve the search quality of all three major search engines. Section 5 surveys related work and Section 6 concludes the paper. We note that all spam investigations and experimental evaluations described in this paper are based on data collected between November and early December 2006. Some of the spam pages may no longer be active or may have changed their behavior. insulate legitimate advertisers from the spam pages where their ads appear [8]: the spammers play the role of publishers by creating low-quality doorway pages that send visiting browsers to their ads-serving redirection domains. When a spam ad is clicked, the click-through traffic usually goes to an anonymous aggregator domain, which funnels such traffic from a large number of spam pages to a handful of syndicators, who are responsible for the final redirection to the target websites owned by the advertisers. It is the advertisers who ultimately pay for the click-through traffic and fund the search-spam industry. We use two examples of successful spam pages to help readers gain a concrete understanding of syndication-based spam. Around November 2006, this spam URL appeared in the top-10 Google search results for “discount chanel handbag”: http://hometown.aol.com/m1stnah/chanelhandbags.html. It was a doorway that redirects to the wellknown redirection domain topsearch10.com. If the http://www.shopping.com ad was clicked, the click-through traffic first went to the aggregator domain 66.230.173.28, which redirected to the syndicator domain looksmart.com, which in turn redirected to the advertiser domain shopping.com. Similarly, this spam URL from a U.S. government domain appeared in the top-10 Yahoo search results for “verizon ringtone”: http://www.usaid.gov/cgibin/goodbye?http://catalog-online.kzn.ru/free/verizonringtones/. It took advantage of the Universal Redirector [6] provided by usaid.gov to redirect to the doorway page at http://catalog-online.kzn.ru/free/verizon-ringtones/, which in turn redirected to the well-known redirection domain paysefeed.net. Clicking the http://usa.funmobile.com ad would send the browser through the following redirection chain: the aggregator domain 66.230.182.178, the syndicator domain findwhat.com, and the advertiser domain funmobile.com. 2.2. Traffic-Affiliate Spammers Some merchant websites provide affiliate programs that pay for traffic drawn to their sites. Many spammers are abusing such programs by creating and promoting doorway pages that redirect to these merchant sites. There is a major difference between traffic-affiliate spam and syndicationbased spam: while the latter displays a list of ads and requires an additional ad-click to initiate the redirection chain leading to the advertiser’s website, the former directly brings the browser to the merchant site. As a result, while the final destination for clicking on a syndication-based doorway is often a spammer-operated ads-serving domain, it is the intermediate redirection domains in the trafficaffiliate scenario that are responsible for the spam. We next describe two examples of traffic-affiliate spam. Around November 2006, this spam URL appeared in the top-10 Live Search results for “cheap ticket”: http://hometown.aol.com/kliktop/Cheap-TICKET.html. 2. Redirection Spam 2.1. Syndication-Based Spammers A legitimate syndication business is typically composed of three layers: the publishers who operate high-quality websites to attract traffic, the advertisers who pay for their ads to appear on those sites, and the syndicator who provides the infrastructure to connect them. A spam-heavy syndication program typically has more layers in order to 2 Clicking on this doorway link would generate a redirection chain that went through two intermediate domains travelavailable.com and bfast.com, and eventually entered expedia.com with a URL that indicates an affiliate ID “affcid=41591555”. Another related spam doorway URL http://hometown.aol.com/kliktop/Asia-TRAVEL.html exhibited the same behavior. As a second example, several adult-content websites have affiliate programs and a handful of spammers appear to participate in multiple such programs by redirecting doorway traffic through an intermediate “rotator” domain that rotates the final destination among them. These spammers use a “keyword blanket” that contains a large number of unrelated keywords to allow their adult links to appear in search results of non-adult queries that consist of less common combinations of keywords. Examples include http://www.booblepoo.org/ (which redirected to http://rotator.leckilecki.com/Esc.php, which in turn redirected to http://www.hornymatches.com/65ec64fd/3/), and http://www.teen-dating.biz/ (which rotates through http://www.rotator.chippie.biz/My_Rotator.php to land on http://www.hornymatches.com/21d80ed5/cf774807/ or one of a few other sites). The numbers in the two finaldestination URLs appear to encode the affiliate IDs. SelfProtection Relevance Ranking Algorithm 3.2.2 Content-Based Ranking Link-Based Ranking Known-bad Signatures: Spammer Redirection Domains 3.1.2 SearchMonkeys: Search, Scan, and Analyze Spammer-Targeted Keywords Grouped Spam-URL Suspects SelfMonitoring 3.1.3 Keyword Extraction Spammer-Targeted Keyword Collector Spammed Forum Hunter Spammed Forums 3.2.1 URL Extraction Spam Verifier Confirmed Spam URLs Known-bad Signatures: Spammer Redirection Domains Spam-URL Suspects Self- 3.2.3 Protection 3. The Strider Search Ranger System 3.1.1 Spam-Heavy Domains Targeted Patrol & Hunting: Spam-Heavy Lists Grouped Spam-URL Suspects Figure 1: Self-monitoring subsystem (dashed lines) watches for new spam patterns; self-protection subsystem (dotted lines) strengthens relevance ranking algorithms and hunts for more related spam URLs. The numbers indicate the section numbers in this paper. Ideally, we would like the first-line-of-defense search ranking algorithms to proactively demote spam pages so that none of them ever shows up in top search results. In practice, search spam is an arms race in which spammers are always discovering algorithmic weaknesses that they can exploit to promote their spam pages. Therefore, it is essential for search engines to have a complementary, reactive anti-spam system, like the Strider Search Ranger system depicted in Figure 1, to provide the last line of defense. The system consists of two major subsystems: the dashed-line boxes in Figure 1 form the self-monitoring subsystem, and the dotted-line boxes belong to the selfprotection subsystem. The oval boxes belong to the relevance ranking algorithms that are outside Search Ranger but take the information marked by gray-color circles as input. The system starts with SearchMonkeys scanning search results of spammer-targeted keywords and providing a prioritized list of groups of spam-URL suspects to the Spam Verifier, which attempts to gather evidence of common spamming activities to confirm the spam status. Confirmed spam URLs then drive the Spammed-forum Hunter to perform a backward discovery of their supporting forums, from which new spammer-targeted keywords are extracted to update the self-monitoring subsystem and new spamURL suspects are extracted to drive the Targeted Patrol & Hunting component. We define the following terms that will be used throughout the remainder of this paper: Link-query: a query of “link:http://foo.com/bar” returns a list of pages that contain links to http://foo.com/bar. Site-query: a query of “site:foo.com bar” returns a list of pages hosted on foo.com that contain the keyword “bar”. 3.1. Self-Monitoring Subsystem 3.1.1. Spammer-Targeted Keyword Collector Large-scale spammers are in the business to make money, so they are not interested in search terms that do not have commercial values. Furthermore, some of the commerce queries already have a long list of well-established websites permanently occupying top search results; these are much harder for spammers to penetrate and so are less attractive to them. These observations motivated us to “follow the money” by discovering and accumulating spammertargeted keywords, and heavily monitoring their top search results in which the spammers must reveal their presence in order to make money. We currently collect spammer-targeted keywords from five sources: a) Anchor text associated with comment-spammed links 3 b) c) d) e) from spammed forums (see Section 3.2.1); these keywords are ranked by their total number of appearances in spammed forums [8]; Hyphenated keywords from spam-URL names: e.g., “fluorescent lighting” from http://www.backupyourbirthcontrol.org/fluorescentlighting.dhtml; Most-bid keywords from a legitimate ads syndication program, that have been spammed; Customer complaints about heavily spammed keywords; Honeypot search terms that are rarely queried by regular users but heavily squatted by spammers who are thirsty for any kind of traffic: these include typos such as “gucci handbg”, “cheep ticket”, etc. and unusual combinations of keywords from keyword-blanket pages as discussed in Section 2. that, among all web pages that eventually land on thirdparty final-destination domains, a high percentage of them are spam. To capture spammers who use a large frame to pull in third-party content without changing the address bar, we include a “most-visible frame” analysis that identifies the frame with the largest size and highlights the redirection domain that is responsible for supplying the content for that frame. For example, the main content for this spam page http://rme21-california-travel.blogspot.com comes from the third-party domain results-today.com, even though the address bar stays on blogspot.com. Our analysis is able to identify results-today.com as the major redirection domain associated with the most-visible frame. Another special form of redirection-based grouping is domain-specific client-ID grouping. This is made possible by ads syndicators and affiliate-program providers who embed their client-IDs as part of the redirection URLs for accounting purposes. For example, these two spam doorway URLs belong to the same Google AdSense customer with ID “ca-pub-4278413812249404”: http://mywebpage.netscape.com/todaysringtones/freenokia-ringtones/ and http://www.blogcharm.com/motorolaringtones/. The two URLs under http://hometown.aol.com/kliktop that were discussed in Section 2.2 and associated with the same Expedia “affcid=41591555” were another good example. We fully expect that redirection spammers will start obfuscating their redirection patterns once all major search engines employ the above straightforward and effective redirection analysis. Table 1 gives an example of how a more sophisticated redirection-similarity analysis can be applied to identifying spam pages that do not use thirdparty redirections. Although the spam URL in (a) redirects to http://yahooooo.info/ but the one in (b) displays content only from its own domain, the two redirection patterns suggest that they are most likely associated with the same spammer (or spam software) because the corresponding GIF files all have exactly the same size. 3.1.2. SearchMonkeys: Search, Scan, and Analyze Given a set of search terms, SearchMonkeys perform a query for each to retrieve the top-N search results, compile a list of unique URLs, and launch a full-fledged browser to visit each URL (which we call “primary URL”) to ensure that spam analysis is applied to the traffic that leads to the true user-seen content. All resulting HTTP traffic is recorded at the network level, including the header and content of each request and response. Each URL is represented by this set of traffic data, the number of its appearances in all the search results and additional pieces of information to be described shortly. We then perform similarity analysis based on all the information to detect correlations among search results that indicate likely spam and to produce a prioritized list of groups of spam-URL suspects for the Spam Verifier. Based on an extensive study of hundreds of thousands of spam pages, we found that the following five types of similarity analyses are useful for catching large-scale spammers: (1) Redirection domains: Grouping based on individual redirection domains is the simplest form of similarity analysis: at the end of each batched scan, primary URLs are grouped under each third-party domain they generated redirection traffic to. Top third-party domains based on the group size – excluding well-known legitimate ads syndicator domains, web-analytics domains, and previously known spammer domains – are highlighted as top suspects for spam investigation. Many large-scale spammers create more than thousands of doorway pages to redirect to a single domain, and we have used this grouping analysis to identify thousands of spammer redirection domains. In most cases, one particular redirection domain – the final-destination domain – is most useful in identifying spammers. The final-destination domain refers to the one that appears in the browser address bar when all redirection activities finish. Experimental results in Section 4 will show Table 1: A redirection spam page sharing a similar traffic pattern with a non-redirection spam page Redirection URLs (under http://yahooooo.info/) Size images/klikvipsearchfooter.gif 768 images/klikvipsearchsearchbarright.gif 820 images/klikvipsearchsearchbarleft.gif 834 images/klikvipsearchheader.gif 2,396 images/klikvipsearchsearchbar.gif 9,862 (a) Primary URL: http://bed-12345.tripod.com/page78book.html Redirection URLs (under http://tramadol-cheap.fryeart.org) images/crpefooter.gif images/crpesearchbarright.gif images/crpesearchbarleft.gif images/crpeheader.gif images/crpesearchbar.gif 4 Size 768 820 834 2,396 9,862 (b) Primary URL: http://tramadol-cheap.fryeart.org/ (5) Click-through analysis: when spam pages are identified as potential ads-portal pages through simple heuristics, redirection traffic following an ad-clicking could be useful for grouping. For example, these two ads-portal pages look quite different: Generalizing the analysis further, we apply the similarity measure proposed by Sun et al. [9] to our problem: each URL is modeled by the number of HTTP response objects and their sizes (i.e., a multiset of response object lengths). The similarity between two URLs is measured by the Jaccard’s coefficient [10]: the size of the intersection (i.e., minimum number of repetitions) divided by the size of the union (i.e., maximum number of repetitions) [11]. (2) Final user-seen page content: this could directly come from the content of one of the HTTP responses or it could be dynamically generated by browser-executed scripts. Content signatures from this page could be used for grouping purposes. For example, the ads links shown on these two pages share the same aggregator domain name (in IP-address form) – 66.230.138.211: But if you click the Orbitz ad on either page, it would generate a redirection chain that includes the same syndicator domain looksmart.com. 3.1.3. Spam Verifier For each group of spam-URL suspects produced by the SearchMonkeys, the spam verifier extracts a sample set, performs seven types of analyses on each URL to gather evidence of spamming behavior, and computes an average spam score for the group. Groups with scores above a threshold are classified as confirmed spam. As shown in Figure 1, confirmed spam URLs are submitted to the spammed-forum hunter and the keyword collector, while confirmed spammer redirection domains are added to the known-bad signature used by both the targeted patrol & hunting component and SearchMonkeys. The remaining unconfirmed groups are submitted to human judges for manual investigation. The seven types of spam analyses are: http://free-porn.IEEEPCS.org/ http://HistMed.org/free-porn.phtml (3) Domain WhoIs and IP address information: for many largest-scale spammers, IP address- or subnet-based grouping is most effective; for example, these two differentlooking blog pages – http://urch.ogymy.info/ and http://anel.banakup.org/ are hosted on the same IP address 72.232.234.154. Similarly, it is not uncommon to see multiple redirection domains share the same IP address/subnet or WhoIs registrant name. For example, these two redirection domains share the same IP address 216.255.181.107 but are registered under different names (“” means “redirects to”): 1. whether the page redirects to a new domain residing on an IP address known to host at least one known spammer redirection domain; 2. whether its link-query results contain a large number of forums, guest books, message boards, etc. that are known to be popular among comment spammers; 3. whether the page uses scripting-on/off cloaking to achieve crawler-browser cloaking, by comparing two vectors of redirection domains from visiting the same page twice – once with regular browser settings and the second time with browser scripting turned off; 4. whether the page uses click-through cloaking [6], by comparing the two redirection vectors from a search-result click-through visit and a direct visit, respectively; 5. whether the page is an ads-portal page that forwards ads click-through traffic to known spam-traffic aggregator domains; 6. whether the page is hosted on a known spam-heavy domain; 7. whether a set of images or scripts match known-spam filenames, sizes, or content hashes. In addition, the spam verifier produces the following three ranked lists to help prioritize targeted self-protection: (1) domains ranked by the number of unique confirmed spam URLs they host; (2) domains ranked by their number of spam appearances in the search results; (3) domains ranked by their spam percentages – number of unique spam http://hometown.aol.com/seostore/discount-coach-purse.html2 topsearch10.net http://www.beepworld.de/memberdateien/members101/satin7 374/buy-propecia.html drugse.com As another example, these two redirection domains share the same registrant name, but are hosted on two different IP addresses: http://9uvarletotr.proboards51.com/ paysefeed.net http://ehome.compuserve.de/copalace15/tracfoneringtones.html arearate.com (4) Link-query results: these provide two possibilities for similarity grouping: (1) link-query result similarity: for example, the top-10 link-query results for http://ritalin.r8.org/ and http://ritalin.ne1.net/ overlap 100% and have a 50% overlap with those for the beepworld.de spam URL mentioned above. This suggests that they are very likely comment-spammed by the same spammer; (2) when a pair of spam URLs appear in each other’s linkquery results, it is a strong indication that they belong to the same link farm. 2 http://cheap-air-ticketoo.blogspot.com/ http://hometown.aol.com/discountz4you/cheap-hotelticket.html This spam URL uses click-through cloaking [6]. The redirection to topsearch10.net can only be seen by clicking through a search result. 5 URLs divided by number of all unique URLs that appear in the search results. The observation that a large-scale spammer has successfully spammed the keywords that SearchMonkeys are monitoring is often an indication that they are systematically attacking a weakness in the ranking algorithms and most likely have penetrated many other unmonitored keywords as well. The targeted patrol & hunting component is responsible for hunting down as many spam pages belonging to the same spammer as possible so that damages can be controlled in the short term while longer-term solutions that involve ranking algorithm modifications are being developed. Since the unmonitored set is much larger than the monitored set, this component relies on the following “spam-heavy lists” to prioritize the use of machine resources: Targeted patrol of search results hosted on spamheavy domains: this component addresses the following practical issue: suppose the self-monitoring part can only afford to cover N search terms and we have approximately the same number of machines for the self-protection part but we would like to clean up, say, 10xN search terms. How should we prioritize the search results to scan? Our current approach is to use the lists of top spam-heavy domains and top spam-percentage domains produced by the spam verifier to filter the larger set of search results for targeted patrol. See Section 4.2.1 for an evaluation of this approach. Targeted hunting of spam suspects from spammed forums: once a set of spammed forums are identified as the culprits for successfully promoting some spam URLs, the other URLs that appear in the same forums are very likely enjoying the same bogus link support and have a better chance of getting into top search results. See Section 4.2.2 for a case study of effective spam hunting through targeted scanning of such spam suspect lists. Malicious spam URLs: we have observed that some malicious website operators that exploit browser vulnerabilities to install malware are also using search spam techniques to get access to more client machines. The Strider Search Ranger system is connected to the Strider HoneyMonkey system, which uses Virtual Machine-based, active client-side honeypot to detect malicious websites [12]. Spam URLs detected by Search Ranger are submitted to HoneyMonkey to check for malicious activities. Once a spam URL is determined to be malicious, it is immediately removed from the index to protect search users, and spam suspects from its link-query results are scanned with high priority. Section 4.2.3 describes an actual example of a malicious URL occupying a large number of top search results at a major search engine. 3.2. Self-Protection Subsystem Taking spam analysis results from the self-monitoring subsystem, the self-protection subsystem is responsible for taking actions to defend the search engine against detected spammers, including a long-term solution to strengthen the ranking algorithms and a short-term solution to clean up existing damages. 3.2.1. Spammed-Forum Hunter Taking confirmed spam URLs as input, the spammedforum hunter performs a link-query for each URL to collect spammed forums that have successfully promoted the spam URL into top search results. It extracts all third-party URLs that appear in such forums, and feeds them to the targeted patrol & hunting component. It also extracts all their associated anchor text (which consists of the keywords that the spammers want search engines to index and are usually the search terms they are targeting) and feeds them to the keyword collector. 3.2.2. Strengthening Relevance Ranking Algorithms In order for a spam link to appear in top search results, the relevance ranking algorithms must have made a mistake in its content- and link-based algorithms. In other words, the spammers must have successfully reverse-engineered the algorithms and crafted their pages and links in a way that fooled the algorithms. To help identify the weakness exploited by the spammers, three pieces of information from spam analysis (see the three gray circles in Figure 1) are provided to the ranking algorithms: Confirmed spam URLs, spammer-targeted keywords, and spam-heavy domains are provided to the link-based ranking algorithms to better train the classifier that is responsible for identifying web pages whose outgoing links should be discounted. In particular, web forums that contain multiple spammer-targeted keywords across unrelated categories (such as drugs, handbags, and mp3) and multiple URLs from different spam-heavy domains are very likely spammed forums that should be flagged. Many spammed forums sharing an uncommon URL substring is often an indication that a new type of forums has been discovered by spammers. Confirmed spam URLs are also provided to the contentbased ranking algorithms. If they are cloaked pages, their indexed fake pages should be excluded from the ranking to avoid “contaminating” the relevance evaluation of good pages; if they are not cloaked pages, their content should be analyzed for new contentspamming tricks. In either case, the spam pages can be used to extract more spammer-targeted keywords to feed the SearchMonkeys. 4. Experimental Evaluations In this section, we present results from experiments that focused on individual subsystems and subsets of keywords to better illustrate the effectiveness of our approach. We use a list of spammer-targeted keywords that we previously constructed [8] for most of the evaluations presented in this 3.2.3. Targeted Patrol & Hunting Component 6 section. Starting with a list of 4,803 confirmed spam URLs, we used link-queries to obtain 35,878 spammed forums, from which we collected 1,132,099 unique anchor-text keywords with a total of 6,026,699 occurrences. We then ranked the keywords by their occurrence counts to produce a sorted list. To minimize human investigation effort, we focused on only spam pages that redirected to third-party final destinations in all the experiments. Unless mentioned otherwise, we obtained the top-20 results for each search query. #3: topsearch10.com topmeds10.com kikos.info hachiksearch.com Total forumfactory.com blogspot.com kostenloses-forum.be blogspot.com blogspot.com asv-basketball.org * 8/1 6/6 4/1 1/1 1/1 1/1 * 18 (0.9%) /8 1 1 1 97 (4.9%) / 16 # app. / # uniq 50 (2.5%) / 50 30 (1.5%) / 17 10 (0.5%) /7 (a) Google results 4.1. Evaluations of Self-Monitoring Subsystem We present data from three experiments to illustrate the practical use of the self-monitoring subsystem: Section 4.1.1 presents basic final destination-based grouping for identifying large-scale spammers who successfully spammed Google and Yahoo; Section 4.1.2 presents a similar analysis for the typo version to evaluate the use of search-term typos as honeypots; Section 4.1.3 presents grouping analyses based on intermediate redirection domains and IP addresses for catching traffic-affiliate spammers who successfully spammed Live Search. 4.1.1. Final-Destination and Doorway Analysis For our first experiment, we selected the top-100 handbag-related keywords from the spammer-targeted list and used SearchMonkeys to scan and analyze search results from both Google and Yahoo. Table 2 compares the percentage of unique URLs among all URLs (row [a]), the percentage among unique URLs that landed on third-party final destinations (row [b]), and the percentage of those URLs that were determined as spam (row [c]). Table 3 summarizes the redirection analyses for all detected spam URLs, where the first column is the list of final-destination domains sorted by the number of searchresult appearances that redirected to them (the 4th column) and the second column contains the list of doorway domains for each final-destination domain, sorted by the number of appearances (the 3rd column). Final-destination domain #1: shajoo.com Doorway domain onlinehome.us freett.com eccentrix.com lacomunitat.net 5 misc. domains # app. / # uniq 16 / 16 15 / 15 5/5 4/4 10 / 10 #2: findover.org hometown.aol.com 30 / 17 #3: topsearch10.com geocities.com blogspot.com hometown.aol.com 3 misc. domains 4 misc. domains 2/2 4/1 1/1 3/3 9/4 freett.com toplog.nl mywebpage.netscape. com 2 misc. domains xoomer.virgilio.it opooch.com * 5/5 4/4 4/4 5/5 4/4 4/4 2/2 1/1 1/1 * 2/2 1/1 1/1 116 (5.8%) / 95 maximumsearch.net myqpu.com nashtop.info fast-info.org searchadv.com results-today.com filldirect.com Total 9/4 (b) Yahoo results First, we observe several similarities between the Google the Yahoo data. Table 2 shows that both of them had only a small percentage of unique search-result URLs (1.8% and 7.1%) that redirected to third-party final destinations, and a large percentage of those URLs (80% and 85%) were determined to be spam. Table 3 shows that both of them had a non-trivial percentage of spam results (4.9% and 5.8% as shown in the last row) that could be detected by Search Ranger and both were spammed by large-scale spammers: the top-3 final-destination domains alone were responsible for 4.7% and 4.5% spam densities. Table 2: Top-100 Handbag Keywords (3 rd means “landing on third-party final destinations”) Google Yahoo Google Yahoo non-typos non-typos typos typos [a] # uniq/all 55% 79% 39% 50% [b] % uniq 3rd 1.8% 7.1% 47% 50% [c] % spam 3rd 80% 85% 99% 79% Table 3: Redirection Analyses of Top-100 Handbag Keywords (Non-Typo) Final-destination Doorway domain # app. # app. domain /# uniq /# uniq #1: forumfactory.com 21 / 1 44 lyarva.com (2.2%) onlinewebservice6.de 20 / 1 /4 foren.cx 2/1 page.tl, 1/1 #2: 32 blogigo.de 32 / 1 biopharmasite.info (1.6%) (malicious URL: /1 blogigo.de/handbag) The two sets of data also exhibit significant differences. Overall, Google had a lower percentage of unique URLs (55% versus 79%) and its spam URLs followed a similar pattern: they had an average appearance count of 97 / 16 = 6.1, which is much higher than Yahoo’s number of 116 / 95 = 1.2. More significantly, its top-3 spam URLs had 32, 21, and 20 appearances, respectively, which are much higher than Yahoo’s maximum per-URL appearances of four. The 7 two lists of final-destination domains and doorways domains also differ significantly (except for the most ubiquitous topsearch10.com and blogspot.com). This demonstrates the importance of search engine-specific selfmonitoring to detect new large-scale spammers who have started to defeat its current anti-spam solution so that proper, engine-specific defense mechanisms can be developed and deployed at an earlier stage to prevent largescale damages to search quality. 4.1.3. Traffic-Affiliate Analysis To detect the keyword-blanket spammers mentioned in Section 2.2, we manually3 constructed 10 search terms using unusual combinations of keywords from a keyword blanket page, for example, “George Bush mset machine learning”. We also received another three search terms through internal customer complaints. We issued these 13 queries at Live Search and retrieved all search results that the engine was willing to return. In total, we obtained 6,300 unique URLs, among which 1,115 (18%) landed on the two largest adult affiliate program providers: 905 to hornymatches.com and 210 to adultfriendfinder.com. Among the 1,115 spam URLs, Table 4 shows the intermediate rotators and the number of doorways associated with each. Domains #1, #4, and #5 were previously unknown spammer redirection domains at that time and have since been added to the signature set. Some of these spammers had successfully spammed Yahoo search results as well. Table 5 shows an alternative grouping analysis based on the IP addresses of doorways, which revealed two additional pieces of information: the close relationship between chippie.biz and tabaiba.biz, and the spam-heavy subnet that contained multiple IP addresses redirecting to w3gay.com. 4.1.2. Search Term Typo-Patrol To study the effect of search-term typos on redirection analysis, we replaced the word “handbag” with “handbg” in the list of keywords used in Section 4.1.1 and rescanned the Google and Yahoo search results. The numbers are summarized in the last two columns of Table 2. Compared to their non-typo counterparts, both sets had a significant drop in the percentage of unique URLs and an even more significant increase in the percentage of URLs that landed on third-party final destinations, among which the spam densities remain very high. This confirms the use of searchterm typos as effective honeypots to obtain search results with a high spam density. Among the 357 unique spam URLs in the Google data, the lesser-known blog site alkablog.com was the top doorway domain responsible for 277 (78%) of the spam URLs, followed by the distant #2 kaosblog.com (15) and #3 creablog.com (13), ahead of the #4 blogspot.com (12). This provides good intelligence on new doorway domains that are becoming spam-heavy and should be scrutinized. Table 4: Adult traffic affiliates and doorways Adult rotators #1 rotator.leckilecki.com/Esc.php #2 borg.w3gay.com/ #3 www.rotator.chippie.biz/My_Rotator.php #4 www.rotator.tabaiba.biz/My_Rotator.php #5 www.rotator.pulpito.biz/My_Rotator.php #6 rotator.siam-data.com/Esc.php #7 rotator.wukku.com/Esc.php Total 4.2. Evaluations of Self-Protection Subsystem We present data from three experiments to illustrate the actual use of the self-protection subsystem in practice: Section 4.2.1 evaluates the effectiveness of targeted patrol of spam-heavy domains; Section 4.2.2 presents a case study of link-query spam hunting; Section 4.2.3 describes an example of malicious-URL spam hunting. # doorways 686 206 90 80 40 10 3 1,115 4.2.1. Targeted Patrol of Search Results To evaluate the effectiveness of targeted patrol of spamheavy domains, we first scanned and analyzed the top-20 Live Search results of the top-1000 keywords from the spammer-targeted list, and derived a total of 89 spam-heavy domains by taking the union of the top-10 domains in terms of the number of hosted spam URLs and number of spam appearances, and all domains that had a higher than 50% spam percentage (see Section 3.1.3). We then performed a “horizontal” targeted patrol of the top-20 results for the next 10,000 spammer-targeted keywords, and a “vertical” targeted patrol of the 21 st to 40th search results for the entire 11,000 keywords. The latter is to minimize the chance that vacated top-20 positions are filled with spam again. Table 5: Adult traffic affiliates and IP addresses IP address 209.85.15.38 207.44.142.129 # doorways 686 170 70.86.247.37 70.86.247.38 70.86.247.34 74.52.19.162 69.93.222.98 207.44.234.82 Total 145 36 25 40 10 3 1,115 Rotator association leckilecki.com chippie.biz (90) & tabaiba.biz (80) w3gay.com (part 1 of 206) w3gay.com (part 2 of 206) w3gay.com (part 3 of 206) pulpito.biz siam-data.com wukku.com * 3 We are currently working on automating the keyword construction process by searching, scanning and analyzing random combinations or applying informational retrieval and natural language techniques to pick search terms that have very few good matches. 8 Table 6 shows that the 89-domain filter selected approximately 10% (9.8% and 9.0%) of the URLs for targeted patrol. Among them, a high percentage (68% and 65%) redirected to third-party final destinations, which indicate that this is most likely a spam-heavy list. Search Ranger analysis confirmed that more than 3 out of every 4 URLs (78% and 76%) on that list were spam. It also confirmed that targeted patrol of spam-heavy domains is productive: approximately half (53% and 49%) of the selected URLs based on spam-heavy domains were spam. # unique URLs in top-20 # top-20 appearances # top-3 appearances # spammed keywords 138 161 50 142 134 229 102 172 We conducted a similar experiment on the 20 angelfire.com spam URLs except that this time we filtered out spam suspects whose URL names contain travel-related keywords or any of the previous four keywords. We collected a total of 1,710 angelfire.com spam suspects, from which we derived 1,598 unique search terms in diverse categories such as “free maid of honor speeches”, “daybed mattress discount”, “cheap viagra uk”, “free pamela anderson video clip”, etc. Again, searchlab.info was behind almost all of the still-active spam URLs. The right-most column of Table 7 summarizes the results: 134 unique URLs had 229 spam appearances in the top-20 search results of 172 keywords, including 102 top-3 appearances. These two sets of data confirm that, if a large-scale spammer is observed to have successfully spammed a set of keywords at a search engine, it is very likely that it has also spammed other keywords and link-query spam hunting is an effective way to discover those spam URLs. Table 6: Horizontal and vertical targeted patrol Top-20 results of Next 20 results of next 10,000 all 11,000 keywords keywords [a] # unique URLs 141,442 165,448 [b] # unique URLs 13,846 14,867 on spam-heavy 9.8% of [a] 9.0% of [a] domains [c] # unique 3rd 9,395 9,649 68% of [b] 65% of [b] [d] # spam URLs 7,339 7,345 53% of [b] 49% of [b] 78% of [c] 76% of [c] 4.2.3. Malicious-Spam Hunting HoneyMonkey scanning of the 16 Google spam URLs detected in Section 4.1.1 revealed that the one with the highest number of per-URL appearances (32) was actually a malicious URL: http:// blogigo.de / handbag (spaces added for safety). By performing 100 queries of “site:blogigo.de” in conjunction with the 100 handbag keywords, we obtained 27 URLs for HoneyMonkey scanning and identified an additional malicious URL: http:// blogigo.de / handbag / handbag / 1. Through link-query spam hunt, we discovered 14 blogigo.de spam suspects and identified another malicious URL: http:// blogigo.de / pain killer, which appeared as the #7 Google search result for “pain killer”. This example demonstrates that, when a search engine makes a mistake and assigns a high ranking to a malicious spam URL, the damages on search-result quality and the potential damages on visitors’ machines could be widespread. It is therefore extremely important for major search engines to monitor for malicious search results and remove them as early as possible to protect search users. 4.2.2. Targeted Hunting of Related Spam For the second experiment, we selected the top-100 travel-related search terms from the spammer-targeted list, and scanned the search results from Live Search. Among the 105 confirmed spam URLs, the top redirection domain searchlab.info was behind 30 doorway URLs – 10 were hosted on geocities.com and 20 on angelfire.com. We use these 30 URLs as seeds to evaluate the cross-category impact of targeted hunting. Link-query results for the 10 geocities.com URLs produced a list of 1,943 spam suspects hosted on the same domain. We extracted 631 URLs that contained at least one of the following four keywords observed to be popular on the list: “parts”, “handbag”, “watch”, and “replica”. With a few exceptions of no-longer-valid URLs, all of the 631 URLs redirected to searchlab.info. We then derived 630 unique search keywords from these URL names and queried Live Search for top-20 results. We found that 138 URLs had a total of 161 appearances in the top-20 results of 142 unique search terms. In particular, 50 appearances were among the top-3 search results for queries like “expensive watches”, “free webcams to watch”, “volkswagon parts”, “designer handbag replicas”, etc. The middle column of Table 7 summarizes the results. 5. Related Work Gyongyi and Garcia-Molina identified cloaking and redirection as two techniques for hiding spam content [2]. Wu and Davison proposed an automated method to detect semantic cloaking, which first identifies suspect pages by the content of the pages returned to a browser and a crawler, and then uses machine learning to create a classifier [13]. Chellapilla and Chickering investigated cloaking from an economic perspective by comparing search results from the top 5,000 queries and the top 5,000 monetizable queries [4]. Benczur et al. presented Spamrank as an automated spam detection technique by identifying Table 7: Link-query Spam Hunt Cross-category Effect Geocities.com Angelfire.com # suspect URLs 631 1,710 # keywords 630 1,598 9 pages that violated the power law distribution by linking to one another [14]. They observed that link similarity measures could be more effective than trust/distrust measures in classifying spam pages. Similarly, Carvalho et al. focused on identifying “noisy” links, which are sites with abnormal support between each other, by measuring the amount of linking between two sites [15]. Many have proposed content-based spam detection techniques [16,17]. Our Search Ranger system is unique in that it makes heavy uses of redirection-based spam detection in an autonomic anti-spam approach. hit rate (49%~53%) and Table 7 showed that spam hunting targeting a particular spammer allowed effective clean-up across a broad range of keywords. References [1] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” http://dbpubs.stanford.edu:8090/pub/1999-66, Jan. 29, 1998. [2] Z. Gyongyi and H. Garcia-Molina, “Web Spam Taxonomy,” in Proc. International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. [3] B. Wu and B. D. Davison, “Cloaking and Redirection: A Preliminary Study,” in Proc. AIRWeb, 2005. [4] K Chellapilla and D. M. Chickering, “Improving Cloaking Detection Using Search Query Popularity and Monetizability,” in Proc. AIRWeb, August 2006. [5] Spam Attack by Website Clones, http://research.microsoft.com/SearchRanger/Spam_Attack_by_ Website_Clones.htm. [6] Y. Niu, Y. M. Wang, H. Chen, M. Ma, and F. Hsu, “A Quantitative Study of Forum Spamming Using Context-based Analysis,” in Proc. Network and Distributed System Security (NDSS) Symposium, 2007. [7] S. R. White, J. E. Hanson, I. Whalley, D. M. Chess, and J. O. Kephart, “An Architectural Approach to Autonomic Computing,” in Proc. ICAC, May 2004. [8] Y. M. Wang, M. Ma, Y. Niu, and H. Chen, “Spam DoubleFunnel: Connecting Web Spammers with Advertisers,” to appear in Proc. International World Wide Web (WWW) Conference, May 2007. [9] Q. Sun, D. R. Simon, Y. M. Wang, W. Russell, V. N. Padmanabhan, and L. Qiu, “Statistical Identification of Encrypted Web Browsing Traffic,” in Proc. IEEE Symp. on Security & Privacy, May 2002. [10] C. J. van Rijsbergen. Information Retrieval. 2nd ed, Butterworths, 1979. [11] T. H. Haveliwala, A. Gionis, and P. Indyk. “Scalable Techniques for Clustering the Web,” in WebDB (Informal Proceedings), pp. 129-134, 2000. [12] Y. M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King, “Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities,” in Proc. NDSS, February 2006. [13] B. Wu and B. D. Davison, “Detecting Semantic Cloaking on the Web,” in Proc. WWW Conference, May 2006. [14] A. Benczur, K. Csalogany, T. Sarlos, and M. Uher, “SpamRank – Fully Automatic Link Spam Detection,” in Proc. AIRWeb, May 2005. [15] A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl, “Site Level Noise Removal for Search Engines,” in Proc. WWW Conference, May 2006. [16] P. Kolari, T. Finin, and A. Joshi, “SVMs for the Blogosphere: Blog Identification and Splog Detection,” in AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006. [17] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, “Detecting Spam Web Pages through Content Analysis,” in Proc. WWW Conference, May 2006. 6. Summary Search spam will always remain an arms race as spammers continuously try to reverse-engineer search engines’ ranking algorithms to find weaknesses that they can exploit. By targeting the two invariants that characterize large-scale spammers – they care only about commerce queries that they can monetize and they want to spam many search results of those queries – we hope to give search engines the upper hand by limiting spammers’ success. We have described the general concept of using similaritybased grouping to monitor search results of spammertargeted keywords, and demonstrated the concept by using the current implementation of the Strider Search Ranger system to defend against large-scale redirection spammers, who have successfully spammed all three major search engines. We have shown through experimental evaluations that a straightforward redirection domain-based grouping is effective today in identifying new spammers, especially those that redirect their spam pages to third-party final destinations. Two sets of data from Table 2 and Table 6 show consistent results: spam-heavy lists of URLs tend to have a high percentage (47%~68%) that land on third-party final destinations and, among those URLs, a large percentage (76%~99%) are spam. We have also provided evidence that more sophisticated grouping techniques should be able to detect tomorrow’s spammers, under the assumption that any automated and systematic, large-scale spamming activities must leave traces of patterns that can be discovered. Our reactive, last-line-of-defense approach complements the proactive, first-line-of-defense relevance ranking algorithms by providing concrete and systematic evidence on the weakness of the algorithms that have been exploited by spammers. The self-protection part of our approach also includes a targeted patrol & hunting component that aims at increasing spammers’ costs by wiping out their investment at the first sign of success of some of their spam. We have shown through experimental evaluations that such targeted actions are often productive and can have immediate positive impact on search quality. In particular, Table 6 showed that targeted scan-candidate selection based on spam-heavy domains produced a list of suspects with a high 10 11