Strider Search Ranger: Towards an Autonomic Anti-Spam Search Engine Yi-Min Wang

advertisement
Strider Search Ranger: Towards an Autonomic Anti-Spam Search Engine
Yi-Min Wang and Ming Ma
Cybersecurity and Systems Management Group
Microsoft Research, Redmond, WA
{ymwang, mingma}@microsoft.com
number of websites artificially linking to each other), insert
hidden links into cloned websites [5], and perform comment
spamming by injecting spam links into the comment fields
of publicly accessible “forums”1 [6].
In this paper, we propose a new approach to combat
large-scale search spam. In contrast with the existing
information retrieval-centric approach of applying content
analysis to all crawler-indexed pages, we model the
problem as a security problem and apply traffic analysis to
true user-seen pages presented by search results of
spammer-targeted commerce queries. By analogy to the
physical world, we do not attempt to separate criminals
from the innocent majority by lining up billions of people
and trying to identify how today’s criminals typically dress
themselves. Instead, we start by gathering intelligence on
where the bad neighborhoods are, put those who are
particularly active in those neighborhoods under
surveillance, capture them as they conduct criminal acts in
the crime scene, interrogate them to collect more
information about their networks, and then hunt down
upstream and downstream miscreants who do business with
them.
Our approach is “autonomic” [7] in that it uses selfmonitoring of search results to detect large-scale spammers
who have successfully defeated the current anti-spam
solutions, and then fights back by addressing the specific
weakness in the ranking algorithms and performing targeted
spam patrol and hunting for self-protection. Specifically,
we focus on patrolling a group of search results that are
more likely to have a high density of spam links, and try to
detect anomalies of correlations that indicate the presence
of large-scale spammers who are occupying a large number
of search results. Based on that information, we strengthen
the search ranking algorithms to broadly defend against
similar attacks in the future, and perform targeted hunting
of related spam pages to clean up existing damages.
We have implemented an autonomic anti-spam system
called Strider Search Ranger and tested it against an
important class of search spam – the redirection spam – as
the first demonstration of the new approach. Redirection
Abstract
Search spammers use questionable search engine
optimization techniques to promote their spam links into
top search results. Large-scale spammers target commerce
queries that they can monetize and attempt to spam as
many top search results of those queries as possible. We
model the large-scale search spam problem as that of
defending against correlated attacks on search rankings
across multiple keywords, and propose an autonomic antispam approach based on self-monitoring and selfprotection. In this new approach, search engines monitor
and correlate their own search results of spammer-targeted
keywords to detect large-scale spam attacks that have
successfully bypassed their current anti-spam solutions.
They then initiate self-protection through targeted patrol of
spam-heavy domains, targeted hunting at the sources of
successful spam, and strengthening of specific weakness in
the ranking algorithms. We describe the Strider Search
Ranger system which implements this new approach, and
focus on its use to defend against an important class of
search spam – the redirection spam – as a demonstration
of the general concept. We evaluate the system by testing it
against actual search results and show that it can detect
useful spam patterns and eliminate a significant amount of
spam for all three major search engines.
1. Introduction
Search spammers (or web spammers) refer to those who
use questionable search engine optimization techniques to
promote their links into top search positions they do not
deserve. Search spam has traditionally been modeled as an
information retrieval and relevance ranking problem: the
content of each web page and the hyperlinking relationship
between pages are analyzed to determine page ranking [1],
and the hope is that spam pages will naturally have lower
ranks and so appear after all non-spam pages in search
results.
However, large-scale search spammers have successfully
attacked the relevance-based, anti-spam solutions by
providing both bogus content and bogus links. To provide
bogus content, they use crawler-browser cloaking [2,3,4]
by serving a non-spam page designed to gain a high
relevance score to the crawlers, but displaying spam content
designed to maximize potential profit to end users. To
provide bogus links, they create “link farms” (i.e., large
1
1
For ease of presentation, throughout the paper, we use the term
“forums” to include all blogs, bulletin boards, message boards,
guest books, web journals, diaries, galleries, archives, etc. that
can be abused by web spammers to promote spam URLs.
spam refers to those spam pages that can be characterized
by the third-party domains they generate redirection traffic
to. It is a particularly challenging problem that is plaguing
all three major search engines because it often involves
large-scale attacks and it typically uses cloaking techniques
to fool content-based analysis.
The use of redirection is becoming essential to a big part
of the search spam business that consists of traffic-affiliate
spammers, who participate directly in merchant websites’
affiliate programs, and syndication-based spammers, who
participate in pay-per-click, advertising syndication
programs and display ads-portal pages. In the affiliate
model, the need for spam pages to redirect to their target
merchant sites is clear. In the syndication model, many
large-scale spammers have moved to the practice of setting
up “throw-away” doorway pages on legitimate websites to
avoid exposing their own domains to blacklisting by search
engines; for example, free blog-hosting sites such as
blogspot.com and free web-hosting sites such as
hometown.aol.com are popular among spammers [8]. Since
spammers do not own these servers, they typically use
client-side scripts to redirect browsers to fetch ads from
redirection domains that they own.
The Strider Search Ranger system combats redirection
spam by monitoring spammer-targeted keywords and
identifying major doorway domains and redirection
domains that have successfully spammed those keywords. It
then directs machine resources accordingly to hunt for more
spam pages that are associated with these successful
spammers and very likely have spammed other keywords as
well. The system also provides concrete feedback on the
weakness of the current content- and link-based ranking
algorithms to help strengthening them against spam.
The paper is organized as follows. Section 2 describes a
few actual redirection-spam examples to set the context.
Section 3 presents the overall architecture of the Search
Ranger system and describes the self-monitoring and selfprotection subsystems. Section 4 evaluates the effectiveness
of the system by demonstrating how they can actually
improve the search quality of all three major search
engines. Section 5 surveys related work and Section 6
concludes the paper. We note that all spam investigations
and experimental evaluations described in this paper are
based on data collected between November and early
December 2006. Some of the spam pages may no longer be
active or may have changed their behavior.
insulate legitimate advertisers from the spam pages where
their ads appear [8]: the spammers play the role of
publishers by creating low-quality doorway pages that send
visiting browsers to their ads-serving redirection domains.
When a spam ad is clicked, the click-through traffic usually
goes to an anonymous aggregator domain, which funnels
such traffic from a large number of spam pages to a handful
of syndicators, who are responsible for the final redirection
to the target websites owned by the advertisers. It is the
advertisers who ultimately pay for the click-through traffic
and fund the search-spam industry.
We use two examples of successful spam pages to help
readers gain a concrete understanding of syndication-based
spam. Around November 2006, this spam URL appeared in
the top-10 Google search results for “discount chanel
handbag”:
http://hometown.aol.com/m1stnah/chanelhandbags.html. It was a doorway that redirects to the wellknown redirection domain topsearch10.com. If the
http://www.shopping.com ad was clicked, the click-through
traffic first went to the aggregator domain 66.230.173.28,
which redirected to the syndicator domain looksmart.com,
which in turn redirected to the advertiser domain
shopping.com.
Similarly, this spam URL from a U.S. government
domain appeared in the top-10 Yahoo search results for
“verizon
ringtone”:
http://www.usaid.gov/cgibin/goodbye?http://catalog-online.kzn.ru/free/verizonringtones/. It took advantage of the Universal Redirector [6]
provided by usaid.gov to redirect to the doorway page at
http://catalog-online.kzn.ru/free/verizon-ringtones/, which
in turn redirected to the well-known redirection domain
paysefeed.net. Clicking the http://usa.funmobile.com ad
would send the browser through the following redirection
chain: the aggregator domain 66.230.182.178, the
syndicator domain findwhat.com, and the advertiser domain
funmobile.com.
2.2. Traffic-Affiliate Spammers
Some merchant websites provide affiliate programs that
pay for traffic drawn to their sites. Many spammers are
abusing such programs by creating and promoting doorway
pages that redirect to these merchant sites. There is a major
difference between traffic-affiliate spam and syndicationbased spam: while the latter displays a list of ads and
requires an additional ad-click to initiate the redirection
chain leading to the advertiser’s website, the former directly
brings the browser to the merchant site. As a result, while
the final destination for clicking on a syndication-based
doorway is often a spammer-operated ads-serving domain,
it is the intermediate redirection domains in the trafficaffiliate scenario that are responsible for the spam.
We next describe two examples of traffic-affiliate spam.
Around November 2006, this spam URL appeared in the
top-10 Live Search results for “cheap ticket”:
http://hometown.aol.com/kliktop/Cheap-TICKET.html.
2. Redirection Spam
2.1. Syndication-Based Spammers
A legitimate syndication business is typically composed
of three layers: the publishers who operate high-quality
websites to attract traffic, the advertisers who pay for their
ads to appear on those sites, and the syndicator who
provides the infrastructure to connect them. A spam-heavy
syndication program typically has more layers in order to
2
Clicking on this doorway link would generate a redirection
chain that went through two intermediate domains
travelavailable.com and bfast.com, and eventually entered
expedia.com with a URL that indicates an affiliate ID
“affcid=41591555”. Another related spam doorway URL
http://hometown.aol.com/kliktop/Asia-TRAVEL.html
exhibited the same behavior.
As a second example, several adult-content websites have
affiliate programs and a handful of spammers appear to
participate in multiple such programs by redirecting
doorway traffic through an intermediate “rotator” domain
that rotates the final destination among them. These
spammers use a “keyword blanket” that contains a large
number of unrelated keywords to allow their adult links to
appear in search results of non-adult queries that consist of
less common combinations of keywords. Examples include
http://www.booblepoo.org/
(which
redirected
to
http://rotator.leckilecki.com/Esc.php, which in turn
redirected to http://www.hornymatches.com/65ec64fd/3/),
and http://www.teen-dating.biz/ (which rotates through
http://www.rotator.chippie.biz/My_Rotator.php to land on
http://www.hornymatches.com/21d80ed5/cf774807/ or one
of a few other sites). The numbers in the two finaldestination URLs appear to encode the affiliate IDs.
SelfProtection
Relevance Ranking Algorithm
3.2.2
Content-Based
Ranking
Link-Based
Ranking
Known-bad
Signatures:
Spammer
Redirection
Domains
3.1.2
SearchMonkeys:
Search, Scan, and Analyze
Spammer-Targeted Keywords
Grouped
Spam-URL
Suspects
SelfMonitoring
3.1.3
Keyword
Extraction
Spammer-Targeted
Keyword Collector
Spammed
Forum
Hunter
Spammed
Forums
3.2.1
URL
Extraction
Spam
Verifier
Confirmed Spam URLs
Known-bad Signatures:
Spammer Redirection Domains
Spam-URL Suspects
Self- 3.2.3
Protection
3. The Strider Search Ranger System
3.1.1
Spam-Heavy Domains
Targeted Patrol & Hunting:
Spam-Heavy Lists
Grouped
Spam-URL
Suspects
Figure 1: Self-monitoring subsystem (dashed lines)
watches for new spam patterns; self-protection subsystem
(dotted lines) strengthens relevance ranking algorithms
and hunts for more related spam URLs. The numbers
indicate the section numbers in this paper.
Ideally, we would like the first-line-of-defense search
ranking algorithms to proactively demote spam pages so
that none of them ever shows up in top search results. In
practice, search spam is an arms race in which spammers
are always discovering algorithmic weaknesses that they
can exploit to promote their spam pages. Therefore, it is
essential for search engines to have a complementary,
reactive anti-spam system, like the Strider Search Ranger
system depicted in Figure 1, to provide the last line of
defense.
The system consists of two major subsystems: the
dashed-line boxes in Figure 1 form the self-monitoring
subsystem, and the dotted-line boxes belong to the selfprotection subsystem. The oval boxes belong to the
relevance ranking algorithms that are outside Search Ranger
but take the information marked by gray-color circles as
input.
The system starts with SearchMonkeys scanning search
results of spammer-targeted keywords and providing a
prioritized list of groups of spam-URL suspects to the Spam
Verifier, which attempts to gather evidence of common
spamming activities to confirm the spam status. Confirmed
spam URLs then drive the Spammed-forum Hunter to
perform a backward discovery of their supporting forums,
from which new spammer-targeted keywords are extracted
to update the self-monitoring subsystem and new spamURL suspects are extracted to drive the Targeted Patrol &
Hunting component.
We define the following terms that will be used
throughout the remainder of this paper:
 Link-query: a query of “link:http://foo.com/bar” returns
a list of pages that contain links to http://foo.com/bar.
 Site-query: a query of “site:foo.com bar” returns a list
of pages hosted on foo.com that contain the keyword
“bar”.
3.1. Self-Monitoring Subsystem
3.1.1. Spammer-Targeted Keyword Collector
Large-scale spammers are in the business to make money,
so they are not interested in search terms that do not have
commercial values. Furthermore, some of the commerce
queries already have a long list of well-established websites
permanently occupying top search results; these are much
harder for spammers to penetrate and so are less attractive
to them. These observations motivated us to “follow the
money” by discovering and accumulating spammertargeted keywords, and heavily monitoring their top search
results in which the spammers must reveal their presence in
order to make money.
We currently collect spammer-targeted keywords from
five sources:
a) Anchor text associated with comment-spammed links
3
b)
c)
d)
e)
from spammed forums (see Section 3.2.1); these
keywords are ranked by their total number of
appearances in spammed forums [8];
Hyphenated keywords from spam-URL names: e.g.,
“fluorescent
lighting”
from
http://www.backupyourbirthcontrol.org/fluorescentlighting.dhtml;
Most-bid keywords from a legitimate ads syndication
program, that have been spammed;
Customer complaints about heavily spammed keywords;
Honeypot search terms that are rarely queried by regular
users but heavily squatted by spammers who are thirsty
for any kind of traffic: these include typos such as
“gucci handbg”, “cheep ticket”, etc. and unusual
combinations of keywords from keyword-blanket pages
as discussed in Section 2.
that, among all web pages that eventually land on thirdparty final-destination domains, a high percentage of them
are spam.
To capture spammers who use a large frame to pull in
third-party content without changing the address bar, we
include a “most-visible frame” analysis that identifies the
frame with the largest size and highlights the redirection
domain that is responsible for supplying the content for that
frame. For example, the main content for this spam page
http://rme21-california-travel.blogspot.com comes from the
third-party domain results-today.com, even though the
address bar stays on blogspot.com. Our analysis is able to
identify results-today.com as the major redirection domain
associated with the most-visible frame.
Another special form of redirection-based grouping is
domain-specific client-ID grouping. This is made possible
by ads syndicators and affiliate-program providers who
embed their client-IDs as part of the redirection URLs for
accounting purposes. For example, these two spam doorway
URLs belong to the same Google AdSense customer with
ID
“ca-pub-4278413812249404”:
http://mywebpage.netscape.com/todaysringtones/freenokia-ringtones/
and
http://www.blogcharm.com/motorolaringtones/. The two
URLs under http://hometown.aol.com/kliktop that were
discussed in Section 2.2 and associated with the same
Expedia “affcid=41591555” were another good example.
We fully expect that redirection spammers will start
obfuscating their redirection patterns once all major search
engines employ the above straightforward and effective
redirection analysis. Table 1 gives an example of how a
more sophisticated redirection-similarity analysis can be
applied to identifying spam pages that do not use thirdparty redirections. Although the spam URL in (a) redirects
to http://yahooooo.info/ but the one in (b) displays content
only from its own domain, the two redirection patterns
suggest that they are most likely associated with the same
spammer (or spam software) because the corresponding
GIF files all have exactly the same size.
3.1.2. SearchMonkeys: Search, Scan, and Analyze
Given a set of search terms, SearchMonkeys perform a
query for each to retrieve the top-N search results, compile
a list of unique URLs, and launch a full-fledged browser to
visit each URL (which we call “primary URL”) to ensure
that spam analysis is applied to the traffic that leads to the
true user-seen content. All resulting HTTP traffic is
recorded at the network level, including the header and
content of each request and response. Each URL is
represented by this set of traffic data, the number of its
appearances in all the search results and additional pieces
of information to be described shortly. We then perform
similarity analysis based on all the information to detect
correlations among search results that indicate likely spam
and to produce a prioritized list of groups of spam-URL
suspects for the Spam Verifier.
Based on an extensive study of hundreds of thousands of
spam pages, we found that the following five types of
similarity analyses are useful for catching large-scale
spammers:
(1) Redirection domains: Grouping based on individual
redirection domains is the simplest form of similarity
analysis: at the end of each batched scan, primary URLs are
grouped under each third-party domain they generated
redirection traffic to. Top third-party domains based on the
group size – excluding well-known legitimate ads
syndicator domains, web-analytics domains, and previously
known spammer domains – are highlighted as top suspects
for spam investigation. Many large-scale spammers create
more than thousands of doorway pages to redirect to a
single domain, and we have used this grouping analysis to
identify thousands of spammer redirection domains.
In most cases, one particular redirection domain – the
final-destination domain – is most useful in identifying
spammers. The final-destination domain refers to the one
that appears in the browser address bar when all redirection
activities finish. Experimental results in Section 4 will show
Table 1: A redirection spam page sharing a similar traffic
pattern with a non-redirection spam page
Redirection URLs (under http://yahooooo.info/)
Size
images/klikvipsearchfooter.gif
768
images/klikvipsearchsearchbarright.gif
820
images/klikvipsearchsearchbarleft.gif
834
images/klikvipsearchheader.gif
2,396
images/klikvipsearchsearchbar.gif
9,862
(a) Primary URL: http://bed-12345.tripod.com/page78book.html
Redirection URLs
(under http://tramadol-cheap.fryeart.org)
images/crpefooter.gif
images/crpesearchbarright.gif
images/crpesearchbarleft.gif
images/crpeheader.gif
images/crpesearchbar.gif
4
Size
768
820
834
2,396
9,862
(b) Primary URL: http://tramadol-cheap.fryeart.org/
(5) Click-through analysis: when spam pages are
identified as potential ads-portal pages through simple
heuristics, redirection traffic following an ad-clicking could
be useful for grouping. For example, these two ads-portal
pages look quite different:
Generalizing the analysis further, we apply the similarity
measure proposed by Sun et al. [9] to our problem: each
URL is modeled by the number of HTTP response objects
and their sizes (i.e., a multiset of response object lengths).
The similarity between two URLs is measured by the
Jaccard’s coefficient [10]: the size of the intersection (i.e.,
minimum number of repetitions) divided by the size of the
union (i.e., maximum number of repetitions) [11].
(2) Final user-seen page content: this could directly come
from the content of one of the HTTP responses or it could
be dynamically generated by browser-executed scripts.
Content signatures from this page could be used for
grouping purposes. For example, the ads links shown on
these two pages share the same aggregator domain name (in
IP-address form) – 66.230.138.211:




But if you click the Orbitz ad on either page, it would
generate a redirection chain that includes the same
syndicator domain looksmart.com.
3.1.3. Spam Verifier
For each group of spam-URL suspects produced by the
SearchMonkeys, the spam verifier extracts a sample set,
performs seven types of analyses on each URL to gather
evidence of spamming behavior, and computes an average
spam score for the group. Groups with scores above a
threshold are classified as confirmed spam. As shown in
Figure 1, confirmed spam URLs are submitted to the
spammed-forum hunter and the keyword collector, while
confirmed spammer redirection domains are added to the
known-bad signature used by both the targeted patrol &
hunting component and SearchMonkeys. The remaining
unconfirmed groups are submitted to human judges for
manual investigation. The seven types of spam analyses are:
http://free-porn.IEEEPCS.org/
http://HistMed.org/free-porn.phtml
(3) Domain WhoIs and IP address information: for many
largest-scale spammers, IP address- or subnet-based
grouping is most effective; for example, these two differentlooking blog pages – http://urch.ogymy.info/
and
http://anel.banakup.org/ are hosted on the same IP address
72.232.234.154. Similarly, it is not uncommon to see
multiple redirection domains share the same IP
address/subnet or WhoIs registrant name. For example,
these two redirection domains share the same IP address
216.255.181.107 but are registered under different names
(“” means “redirects to”):


1. whether the page redirects to a new domain residing on
an IP address known to host at least one known spammer
redirection domain;
2. whether its link-query results contain a large number of
forums, guest books, message boards, etc. that are known to
be popular among comment spammers;
3. whether the page uses scripting-on/off cloaking to
achieve crawler-browser cloaking, by comparing two
vectors of redirection domains from visiting the same page
twice – once with regular browser settings and the second
time with browser scripting turned off;
4. whether the page uses click-through cloaking [6], by
comparing the two redirection vectors from a search-result
click-through visit and a direct visit, respectively;
5. whether the page is an ads-portal page that forwards ads
click-through traffic to known spam-traffic aggregator
domains;
6. whether the page is hosted on a known spam-heavy
domain;
7. whether a set of images or scripts match known-spam
filenames, sizes, or content hashes.
In addition, the spam verifier produces the following
three ranked lists to help prioritize targeted self-protection:
(1) domains ranked by the number of unique confirmed
spam URLs they host; (2) domains ranked by their number
of spam appearances in the search results; (3) domains
ranked by their spam percentages – number of unique spam
http://hometown.aol.com/seostore/discount-coach-purse.html2
 topsearch10.net
http://www.beepworld.de/memberdateien/members101/satin7
374/buy-propecia.html  drugse.com
As another example, these two redirection domains share
the same registrant name, but are hosted on two different IP
addresses:
 http://9uvarletotr.proboards51.com/  paysefeed.net

http://ehome.compuserve.de/copalace15/tracfoneringtones.html  arearate.com
(4) Link-query results: these provide two possibilities for
similarity grouping: (1) link-query result similarity: for
example,
the
top-10
link-query
results
for
http://ritalin.r8.org/ and http://ritalin.ne1.net/ overlap 100%
and have a 50% overlap with those for the beepworld.de
spam URL mentioned above. This suggests that they are
very likely comment-spammed by the same spammer; (2)
when a pair of spam URLs appear in each other’s linkquery results, it is a strong indication that they belong to the
same link farm.
2
http://cheap-air-ticketoo.blogspot.com/
http://hometown.aol.com/discountz4you/cheap-hotelticket.html
This spam URL uses click-through cloaking [6]. The redirection
to topsearch10.net can only be seen by clicking through a search
result.
5
URLs divided by number of all unique URLs that appear in
the search results.
The observation that a large-scale spammer has
successfully spammed the keywords that SearchMonkeys
are monitoring is often an indication that they are
systematically attacking a weakness in the ranking
algorithms and most likely have penetrated many other
unmonitored keywords as well. The targeted patrol &
hunting component is responsible for hunting down as
many spam pages belonging to the same spammer as
possible so that damages can be controlled in the short term
while longer-term solutions that involve ranking algorithm
modifications are being developed. Since the unmonitored
set is much larger than the monitored set, this component
relies on the following “spam-heavy lists” to prioritize the
use of machine resources:
 Targeted patrol of search results hosted on spamheavy domains: this component addresses the following
practical issue: suppose the self-monitoring part can only
afford to cover N search terms and we have approximately
the same number of machines for the self-protection part
but we would like to clean up, say, 10xN search terms. How
should we prioritize the search results to scan? Our current
approach is to use the lists of top spam-heavy domains and
top spam-percentage domains produced by the spam
verifier to filter the larger set of search results for targeted
patrol. See Section 4.2.1 for an evaluation of this approach.
 Targeted hunting of spam suspects from spammed
forums: once a set of spammed forums are identified as the
culprits for successfully promoting some spam URLs, the
other URLs that appear in the same forums are very likely
enjoying the same bogus link support and have a better
chance of getting into top search results. See Section 4.2.2
for a case study of effective spam hunting through targeted
scanning of such spam suspect lists.
 Malicious spam URLs: we have observed that some
malicious website operators that exploit browser
vulnerabilities to install malware are also using search spam
techniques to get access to more client machines. The
Strider Search Ranger system is connected to the Strider
HoneyMonkey system, which uses Virtual Machine-based,
active client-side honeypot to detect malicious websites
[12]. Spam URLs detected by Search Ranger are submitted
to HoneyMonkey to check for malicious activities. Once a
spam URL is determined to be malicious, it is immediately
removed from the index to protect search users, and spam
suspects from its link-query results are scanned with high
priority. Section 4.2.3 describes an actual example of a
malicious URL occupying a large number of top search
results at a major search engine.
3.2. Self-Protection Subsystem
Taking spam analysis results from the self-monitoring
subsystem, the self-protection subsystem is responsible for
taking actions to defend the search engine against detected
spammers, including a long-term solution to strengthen the
ranking algorithms and a short-term solution to clean up
existing damages.
3.2.1. Spammed-Forum Hunter
Taking confirmed spam URLs as input, the spammedforum hunter performs a link-query for each URL to collect
spammed forums that have successfully promoted the spam
URL into top search results. It extracts all third-party URLs
that appear in such forums, and feeds them to the targeted
patrol & hunting component. It also extracts all their
associated anchor text (which consists of the keywords that
the spammers want search engines to index and are usually
the search terms they are targeting) and feeds them to the
keyword collector.
3.2.2. Strengthening Relevance Ranking Algorithms
In order for a spam link to appear in top search results,
the relevance ranking algorithms must have made a mistake
in its content- and link-based algorithms. In other words,
the spammers must have successfully reverse-engineered
the algorithms and crafted their pages and links in a way
that fooled the algorithms.
To help identify the weakness exploited by the
spammers, three pieces of information from spam analysis
(see the three gray circles in Figure 1) are provided to the
ranking algorithms:
 Confirmed spam URLs, spammer-targeted keywords,
and spam-heavy domains are provided to the link-based
ranking algorithms to better train the classifier that is
responsible for identifying web pages whose outgoing
links should be discounted. In particular, web forums
that contain multiple spammer-targeted keywords across
unrelated categories (such as drugs, handbags, and mp3)
and multiple URLs from different spam-heavy domains
are very likely spammed forums that should be flagged.
Many spammed forums sharing an uncommon URL substring is often an indication that a new type of forums
has been discovered by spammers.
 Confirmed spam URLs are also provided to the contentbased ranking algorithms. If they are cloaked pages,
their indexed fake pages should be excluded from the
ranking to avoid “contaminating” the relevance
evaluation of good pages; if they are not cloaked pages,
their content should be analyzed for new contentspamming tricks. In either case, the spam pages can be
used to extract more spammer-targeted keywords to feed
the SearchMonkeys.
4. Experimental Evaluations
In this section, we present results from experiments that
focused on individual subsystems and subsets of keywords
to better illustrate the effectiveness of our approach. We use
a list of spammer-targeted keywords that we previously
constructed [8] for most of the evaluations presented in this
3.2.3. Targeted Patrol & Hunting Component
6
section. Starting with a list of 4,803 confirmed spam URLs,
we used link-queries to obtain 35,878 spammed forums,
from which we collected 1,132,099 unique anchor-text
keywords with a total of 6,026,699 occurrences. We then
ranked the keywords by their occurrence counts to produce
a sorted list. To minimize human investigation effort, we
focused on only spam pages that redirected to third-party
final destinations in all the experiments. Unless mentioned
otherwise, we obtained the top-20 results for each search
query.
#3:
topsearch10.com
topmeds10.com
kikos.info
hachiksearch.com
Total
forumfactory.com
blogspot.com
kostenloses-forum.be
blogspot.com
blogspot.com
asv-basketball.org
*
8/1
6/6
4/1
1/1
1/1
1/1
*
18
(0.9%)
/8
1
1
1
97
(4.9%)
/ 16
# app.
/ # uniq
50
(2.5%)
/ 50
30
(1.5%)
/ 17
10
(0.5%)
/7
(a) Google results
4.1. Evaluations of Self-Monitoring Subsystem
We present data from three experiments to illustrate the
practical use of the self-monitoring subsystem: Section
4.1.1 presents basic final destination-based grouping for
identifying large-scale spammers who successfully
spammed Google and Yahoo; Section 4.1.2 presents a
similar analysis for the typo version to evaluate the use of
search-term typos as honeypots; Section 4.1.3 presents
grouping analyses based on intermediate redirection
domains and IP addresses for catching traffic-affiliate
spammers who successfully spammed Live Search.
4.1.1. Final-Destination and Doorway Analysis
For our first experiment, we selected the top-100
handbag-related keywords from the spammer-targeted list
and used SearchMonkeys to scan and analyze search results
from both Google and Yahoo. Table 2 compares the
percentage of unique URLs among all URLs (row [a]), the
percentage among unique URLs that landed on third-party
final destinations (row [b]), and the percentage of those
URLs that were determined as spam (row [c]).
Table 3 summarizes the redirection analyses for all
detected spam URLs, where the first column is the list of
final-destination domains sorted by the number of searchresult appearances that redirected to them (the 4th column)
and the second column contains the list of doorway
domains for each final-destination domain, sorted by the
number of appearances (the 3rd column).
Final-destination
domain
#1:
shajoo.com
Doorway domain
onlinehome.us
freett.com
eccentrix.com
lacomunitat.net
5 misc. domains
# app.
/ # uniq
16 / 16
15 / 15
5/5
4/4
10 / 10
#2:
findover.org
hometown.aol.com
30 / 17
#3:
topsearch10.com
geocities.com
blogspot.com
hometown.aol.com
3 misc. domains
4 misc. domains
2/2
4/1
1/1
3/3
9/4
freett.com
toplog.nl
mywebpage.netscape.
com
2 misc. domains
xoomer.virgilio.it
opooch.com
*
5/5
4/4
4/4
5/5
4/4
4/4
2/2
1/1
1/1
*
2/2
1/1
1/1
116
(5.8%)
/ 95
maximumsearch.net
myqpu.com
nashtop.info
fast-info.org
searchadv.com
results-today.com
filldirect.com
Total
9/4
(b) Yahoo results
First, we observe several similarities between the Google
the Yahoo data. Table 2 shows that both of them had only a
small percentage of unique search-result URLs (1.8% and
7.1%) that redirected to third-party final destinations, and a
large percentage of those URLs (80% and 85%) were
determined to be spam. Table 3 shows that both of them
had a non-trivial percentage of spam results (4.9% and
5.8% as shown in the last row) that could be detected by
Search Ranger and both were spammed by large-scale
spammers: the top-3 final-destination domains alone were
responsible for 4.7% and 4.5% spam densities.
Table 2: Top-100 Handbag Keywords (3 rd means “landing
on third-party final destinations”)
Google
Yahoo
Google Yahoo
non-typos non-typos
typos
typos
[a] # uniq/all
55%
79%
39%
50%
[b] % uniq 3rd
1.8%
7.1%
47%
50%
[c] % spam 3rd
80%
85%
99%
79%
Table 3: Redirection Analyses of Top-100 Handbag
Keywords (Non-Typo)
Final-destination
Doorway domain
# app.
# app.
domain
/# uniq
/# uniq
#1:
forumfactory.com
21 / 1
44
lyarva.com
(2.2%)
onlinewebservice6.de
20 / 1
/4
foren.cx
2/1
page.tl,
1/1
#2:
32
blogigo.de
32 / 1
biopharmasite.info
(1.6%)
(malicious
URL:
/1
blogigo.de/handbag)
The two sets of data also exhibit significant differences.
Overall, Google had a lower percentage of unique URLs
(55% versus 79%) and its spam URLs followed a similar
pattern: they had an average appearance count of 97 / 16 =
6.1, which is much higher than Yahoo’s number of 116 / 95
= 1.2. More significantly, its top-3 spam URLs had 32, 21,
and 20 appearances, respectively, which are much higher
than Yahoo’s maximum per-URL appearances of four. The
7
two lists of final-destination domains and doorways
domains also differ significantly (except for the most
ubiquitous topsearch10.com and blogspot.com). This
demonstrates the importance of search engine-specific selfmonitoring to detect new large-scale spammers who have
started to defeat its current anti-spam solution so that
proper, engine-specific defense mechanisms can be
developed and deployed at an earlier stage to prevent largescale damages to search quality.
4.1.3. Traffic-Affiliate Analysis
To detect the keyword-blanket spammers mentioned in
Section 2.2, we manually3 constructed 10 search terms
using unusual combinations of keywords from a keyword
blanket page, for example, “George Bush mset machine
learning”. We also received another three search terms
through internal customer complaints. We issued these 13
queries at Live Search and retrieved all search results that
the engine was willing to return. In total, we obtained 6,300
unique URLs, among which 1,115 (18%) landed on the two
largest adult affiliate program providers: 905 to
hornymatches.com and 210 to adultfriendfinder.com.
Among the 1,115 spam URLs, Table 4 shows the
intermediate rotators and the number of doorways
associated with each. Domains #1, #4, and #5 were
previously unknown spammer redirection domains at that
time and have since been added to the signature set. Some
of these spammers had successfully spammed Yahoo search
results as well. Table 5 shows an alternative grouping
analysis based on the IP addresses of doorways, which
revealed two additional pieces of information: the close
relationship between chippie.biz and tabaiba.biz, and the
spam-heavy subnet that contained multiple IP addresses
redirecting to w3gay.com.
4.1.2. Search Term Typo-Patrol
To study the effect of search-term typos on redirection
analysis, we replaced the word “handbag” with “handbg”
in the list of keywords used in Section 4.1.1 and rescanned
the Google and Yahoo search results. The numbers are
summarized in the last two columns of Table 2. Compared
to their non-typo counterparts, both sets had a significant
drop in the percentage of unique URLs and an even more
significant increase in the percentage of URLs that landed
on third-party final destinations, among which the spam
densities remain very high. This confirms the use of searchterm typos as effective honeypots to obtain search results
with a high spam density.
Among the 357 unique spam URLs in the Google data,
the lesser-known blog site alkablog.com was the top
doorway domain responsible for 277 (78%) of the spam
URLs, followed by the distant #2 kaosblog.com (15) and #3
creablog.com (13), ahead of the #4 blogspot.com (12). This
provides good intelligence on new doorway domains that
are becoming spam-heavy and should be scrutinized.
Table 4: Adult traffic affiliates and doorways
Adult rotators
#1
rotator.leckilecki.com/Esc.php
#2
borg.w3gay.com/
#3
www.rotator.chippie.biz/My_Rotator.php
#4
www.rotator.tabaiba.biz/My_Rotator.php
#5
www.rotator.pulpito.biz/My_Rotator.php
#6
rotator.siam-data.com/Esc.php
#7
rotator.wukku.com/Esc.php
Total
4.2. Evaluations of Self-Protection Subsystem
We present data from three experiments to illustrate the
actual use of the self-protection subsystem in practice:
Section 4.2.1 evaluates the effectiveness of targeted patrol
of spam-heavy domains; Section 4.2.2 presents a case study
of link-query spam hunting; Section 4.2.3 describes an
example of malicious-URL spam hunting.
# doorways
686
206
90
80
40
10
3
1,115
4.2.1. Targeted Patrol of Search Results
To evaluate the effectiveness of targeted patrol of spamheavy domains, we first scanned and analyzed the top-20
Live Search results of the top-1000 keywords from the
spammer-targeted list, and derived a total of 89 spam-heavy
domains by taking the union of the top-10 domains in terms
of the number of hosted spam URLs and number of spam
appearances, and all domains that had a higher than 50%
spam percentage (see Section 3.1.3).
We then performed a “horizontal” targeted patrol of the
top-20 results for the next 10,000 spammer-targeted
keywords, and a “vertical” targeted patrol of the 21 st to 40th
search results for the entire 11,000 keywords. The latter is
to minimize the chance that vacated top-20 positions are
filled with spam again.
Table 5: Adult traffic affiliates and IP addresses
IP address
209.85.15.38
207.44.142.129
# doorways
686
170
70.86.247.37
70.86.247.38
70.86.247.34
74.52.19.162
69.93.222.98
207.44.234.82
Total
145
36
25
40
10
3
1,115
Rotator association
leckilecki.com
chippie.biz (90) &
tabaiba.biz (80)
w3gay.com (part 1 of 206)
w3gay.com (part 2 of 206)
w3gay.com (part 3 of 206)
pulpito.biz
siam-data.com
wukku.com
*
3
We are currently working on automating the keyword
construction process by searching, scanning and analyzing
random combinations or applying informational retrieval and
natural language techniques to pick search terms that have very
few good matches.
8
Table 6 shows that the 89-domain filter selected
approximately 10% (9.8% and 9.0%) of the URLs for
targeted patrol. Among them, a high percentage (68% and
65%) redirected to third-party final destinations, which
indicate that this is most likely a spam-heavy list. Search
Ranger analysis confirmed that more than 3 out of every 4
URLs (78% and 76%) on that list were spam. It also
confirmed that targeted patrol of spam-heavy domains is
productive: approximately half (53% and 49%) of the
selected URLs based on spam-heavy domains were spam.
# unique URLs in top-20
# top-20 appearances
# top-3 appearances
# spammed keywords
138
161
50
142
134
229
102
172
We conducted a similar experiment on the 20
angelfire.com spam URLs except that this time we filtered
out spam suspects whose URL names contain travel-related
keywords or any of the previous four keywords. We
collected a total of 1,710 angelfire.com spam suspects, from
which we derived 1,598 unique search terms in diverse
categories such as “free maid of honor speeches”, “daybed
mattress discount”, “cheap viagra uk”, “free pamela
anderson video clip”, etc. Again, searchlab.info was behind
almost all of the still-active spam URLs. The right-most
column of Table 7 summarizes the results: 134 unique
URLs had 229 spam appearances in the top-20 search
results of 172 keywords, including 102 top-3 appearances.
These two sets of data confirm that, if a large-scale
spammer is observed to have successfully spammed a set of
keywords at a search engine, it is very likely that it has also
spammed other keywords and link-query spam hunting is an
effective way to discover those spam URLs.
Table 6: Horizontal and vertical targeted patrol
Top-20 results of
Next 20 results of
next 10,000
all 11,000
keywords
keywords
[a] # unique URLs
141,442
165,448
[b] # unique URLs
13,846
14,867
on spam-heavy
9.8% of [a]
9.0% of [a]
domains
[c] # unique 3rd
9,395
9,649
68% of [b]
65% of [b]
[d] # spam URLs
7,339
7,345
53% of [b]
49% of [b]
78% of [c]
76% of [c]
4.2.3. Malicious-Spam Hunting
HoneyMonkey scanning of the 16 Google spam URLs
detected in Section 4.1.1 revealed that the one with the
highest number of per-URL appearances (32) was actually a
malicious URL: http:// blogigo.de / handbag (spaces added
for safety). By performing 100 queries of “site:blogigo.de”
in conjunction with the 100 handbag keywords, we obtained
27 URLs for HoneyMonkey scanning and identified an
additional malicious URL: http:// blogigo.de / handbag /
handbag / 1. Through link-query spam hunt, we discovered
14 blogigo.de spam suspects and identified another
malicious URL: http:// blogigo.de / pain killer, which
appeared as the #7 Google search result for “pain killer”.
This example demonstrates that, when a search engine
makes a mistake and assigns a high ranking to a malicious
spam URL, the damages on search-result quality and the
potential damages on visitors’ machines could be widespread. It is therefore extremely important for major search
engines to monitor for malicious search results and remove
them as early as possible to protect search users.
4.2.2. Targeted Hunting of Related Spam
For the second experiment, we selected the top-100
travel-related search terms from the spammer-targeted list,
and scanned the search results from Live Search. Among
the 105 confirmed spam URLs, the top redirection domain
searchlab.info was behind 30 doorway URLs – 10 were
hosted on geocities.com and 20 on angelfire.com. We use
these 30 URLs as seeds to evaluate the cross-category
impact of targeted hunting.
Link-query results for the 10 geocities.com URLs
produced a list of 1,943 spam suspects hosted on the same
domain. We extracted 631 URLs that contained at least one
of the following four keywords observed to be popular on
the list: “parts”, “handbag”, “watch”, and “replica”.
With a few exceptions of no-longer-valid URLs, all of the
631 URLs redirected to searchlab.info. We then derived
630 unique search keywords from these URL names and
queried Live Search for top-20 results. We found that 138
URLs had a total of 161 appearances in the top-20 results
of 142 unique search terms. In particular, 50 appearances
were among the top-3 search results for queries like
“expensive watches”, “free webcams to watch”,
“volkswagon parts”, “designer handbag replicas”, etc.
The middle column of Table 7 summarizes the results.
5. Related Work
Gyongyi and Garcia-Molina identified cloaking and
redirection as two techniques for hiding spam content [2].
Wu and Davison proposed an automated method to detect
semantic cloaking, which first identifies suspect pages by
the content of the pages returned to a browser and a
crawler, and then uses machine learning to create a
classifier [13]. Chellapilla and Chickering investigated
cloaking from an economic perspective by comparing
search results from the top 5,000 queries and the top 5,000
monetizable queries [4]. Benczur et al. presented Spamrank
as an automated spam detection technique by identifying
Table 7: Link-query Spam Hunt Cross-category Effect
Geocities.com
Angelfire.com
# suspect URLs
631
1,710
# keywords
630
1,598
9
pages that violated the power law distribution by linking to
one another [14]. They observed that link similarity
measures could be more effective than trust/distrust
measures in classifying spam pages. Similarly, Carvalho et
al. focused on identifying “noisy” links, which are sites with
abnormal support between each other, by measuring the
amount of linking between two sites [15]. Many have
proposed content-based spam detection techniques [16,17].
Our Search Ranger system is unique in that it makes heavy
uses of redirection-based spam detection in an autonomic
anti-spam approach.
hit rate (49%~53%) and Table 7 showed that spam hunting
targeting a particular spammer allowed effective clean-up
across a broad range of keywords.
References
[1] L. Page, S. Brin, R. Motwani, and T. Winograd, “The
PageRank Citation Ranking: Bringing Order to the Web,”
http://dbpubs.stanford.edu:8090/pub/1999-66, Jan. 29, 1998.
[2] Z. Gyongyi and H. Garcia-Molina, “Web Spam Taxonomy,”
in Proc. International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb), 2005.
[3] B. Wu and B. D. Davison, “Cloaking and Redirection: A
Preliminary Study,” in Proc. AIRWeb, 2005.
[4] K Chellapilla and D. M. Chickering, “Improving Cloaking
Detection Using Search Query Popularity and Monetizability,”
in Proc. AIRWeb, August 2006.
[5]
Spam
Attack
by
Website
Clones,
http://research.microsoft.com/SearchRanger/Spam_Attack_by_
Website_Clones.htm.
[6] Y. Niu, Y. M. Wang, H. Chen, M. Ma, and F. Hsu, “A
Quantitative Study of Forum Spamming Using Context-based
Analysis,” in Proc. Network and Distributed System Security
(NDSS) Symposium, 2007.
[7] S. R. White, J. E. Hanson, I. Whalley, D. M. Chess, and J. O.
Kephart, “An Architectural Approach to Autonomic
Computing,” in Proc. ICAC, May 2004.
[8] Y. M. Wang, M. Ma, Y. Niu, and H. Chen, “Spam DoubleFunnel: Connecting Web Spammers with Advertisers,” to
appear in Proc. International World Wide Web (WWW)
Conference, May 2007.
[9] Q. Sun, D. R. Simon, Y. M. Wang, W. Russell, V. N.
Padmanabhan, and L. Qiu, “Statistical Identification of
Encrypted Web Browsing Traffic,” in Proc. IEEE Symp. on
Security & Privacy, May 2002.
[10] C. J. van Rijsbergen. Information Retrieval. 2nd ed,
Butterworths, 1979.
[11] T. H. Haveliwala, A. Gionis, and P. Indyk. “Scalable
Techniques for Clustering the Web,” in WebDB (Informal
Proceedings), pp. 129-134, 2000.
[12] Y. M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski,
S. Chen, and S. King, “Automated Web Patrol with Strider
HoneyMonkeys: Finding Web Sites That Exploit Browser
Vulnerabilities,” in Proc. NDSS, February 2006.
[13] B. Wu and B. D. Davison, “Detecting Semantic Cloaking on
the Web,” in Proc. WWW Conference, May 2006.
[14] A. Benczur, K. Csalogany, T. Sarlos, and M. Uher,
“SpamRank – Fully Automatic Link Spam Detection,” in Proc.
AIRWeb, May 2005.
[15] A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P.
Calado, and W. Nejdl, “Site Level Noise Removal for Search
Engines,” in Proc. WWW Conference, May 2006.
[16] P. Kolari, T. Finin, and A. Joshi, “SVMs for the
Blogosphere: Blog Identification and Splog Detection,” in
AAAI Spring Symposium on Computational Approaches to
Analysing Weblogs, March 2006.
[17] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly,
“Detecting Spam Web Pages through Content Analysis,” in
Proc. WWW Conference, May 2006.
6. Summary
Search spam will always remain an arms race as
spammers continuously try to reverse-engineer search
engines’ ranking algorithms to find weaknesses that they
can exploit. By targeting the two invariants that characterize
large-scale spammers – they care only about commerce
queries that they can monetize and they want to spam many
search results of those queries – we hope to give search
engines the upper hand by limiting spammers’ success. We
have described the general concept of using similaritybased grouping to monitor search results of spammertargeted keywords, and demonstrated the concept by using
the current implementation of the Strider Search Ranger
system to defend against large-scale redirection spammers,
who have successfully spammed all three major search
engines.
We have shown through experimental evaluations that a
straightforward redirection domain-based grouping is
effective today in identifying new spammers, especially
those that redirect their spam pages to third-party final
destinations. Two sets of data from Table 2 and Table 6
show consistent results: spam-heavy lists of URLs tend to
have a high percentage (47%~68%) that land on third-party
final destinations and, among those URLs, a large
percentage (76%~99%) are spam. We have also provided
evidence that more sophisticated grouping techniques
should be able to detect tomorrow’s spammers, under the
assumption that any automated and systematic, large-scale
spamming activities must leave traces of patterns that can
be discovered.
Our reactive, last-line-of-defense approach complements
the proactive, first-line-of-defense relevance ranking
algorithms by providing concrete and systematic evidence
on the weakness of the algorithms that have been exploited
by spammers. The self-protection part of our approach also
includes a targeted patrol & hunting component that aims at
increasing spammers’ costs by wiping out their investment
at the first sign of success of some of their spam. We have
shown through experimental evaluations that such targeted
actions are often productive and can have immediate
positive impact on search quality. In particular, Table 6
showed that targeted scan-candidate selection based on
spam-heavy domains produced a list of suspects with a high
10
11
Download