CS/INFO 430 Information Retrieval Web Search 4 Lecture 18

advertisement
CS/INFO 430
Information Retrieval
Lecture 18
Web Search 4
1
Course Administration
2
Search Engine Spam: Objective
Success of commercial Web sites depends on the number of
visitors that find the site while searching for a particular
product.
85% of searchers look at only the first page of results
A new business sector – search engine optimization
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in
web search engines. International Joint Conference on
Artificial Intelligence, 2003.
Drost, I. and Scheffer, T., Thwarting the Nigritude
Ultramarine: Learning to Identify Link Spam. 16th European
Conference on Machine Learning, Porto, 2005
3
Spam: Meta Tags
Meta tags provide the creator of a Web page a place for
cataloguing data that describes the page, but it can be used for
advertising, misleading, or other mischievous text
Example: http://www.georgewbush.com/ (October 2000)
<meta name="DC.Subject" content="George W. Bush;
Bush; George Bush; President; republican; 2000 election;
election; presidential election; George; B2K; Bush for
President; Junior; Texas; Governor; taxes; technology;
education; agriculture; health care; environment; society;
social security; medicare; income tax; foreign policy;
defense; government">
4
Search Engine Spam: Techniques
Invisible text:
Add keywords to a page in the hope that search engines
will index it, but organized so that it will not be visible to
a user, e.g., in special type of format, background color,
etc.
Cloaking:
Return different page to Web crawlers than to ordinary
downloads.
(Can also be used to help Web search, e.g., by providing
a text version of a highly visual page.)
5
Search Engine Spam: Anchor Text
Search engines assume that anchor text provides helpful
terms to index the page that is linked to. But anchor text can
be deliberately misleading.
Consider the impact if a million pages each contained the
anchor text:
<a href = "http://www.columbia.edu/>Cornell University</a>
6
Search Engine Spam: Anchor Text
Google Bomb: a collective hyperlinking strategy intended to
change the search results of a specific term or phrase.
Examples
The "miserable failure" Google bomb promoted George W.
Bush’s page on whitehouse.gov to the number one rank in a
search of the phrase "miserable failure."
The "Jew" Google bomb demoted an anti–Semitic Web site
from number one rank with a search of "Jew," and promoted
the wikipedia.org definition of "Jew" to number one.
See: Clifford Tatum, 2005,
http://www.firstmonday.org/issues/issue10_10/tatum/
7
Link Spamming: Techniques
Link exchange services: Listings of (often unrelated)
hyperlinks. To be listed, businesses have to provide a back
link that enhances the PageRank of the exchange service.
Guestbooks, discussion boards, and weblogs: Automatic
tools post large numbers of messages to many sites; each
message contains a hyperlink to the target website.
Link farms: Densely connected arrays of pages. Farm pages
propagate their PageRank to the target, e.g., by a funnelshaped architecture that points directly or indirectly towards
the target page. To camouflage link farms, tools fill in
inconspicuous content, e.g., by copying news bulletins.
8
Search Engine Spam: Link Farms
Link from W to F for
crawler to find F
The regular Web, W,
with nw pages.
9
A link farm, F,
with nf pages
Search Engine Spam: Link Farms
Consider the PageRank iteration formula
wk = (1-d)w0 + dBwk-1
Assuming that all pages are crawled, the effect of the factor
(1-d)w0
is that the random jumps go to W and F in the ratio nw:nf .
Since there are few links between W and F, the effect of B is
to assign PageRank within W and F respectively.
Therefore the total PageRank is divided between W and F in
the ratio nw:nf .
10
Search Engine Spam: Link Farms
The manager of the link farm, F, can organize the
links within the farm so that certain pages within the
farm, h1, h2, ..., hk, are highly ranked.
A manager who wants to give high rank to a page w0
in W, places links to w0 from several of the pages h1,
h2, ..., hk.
As a result, w0 is linked to from several highly ranked
pages and hence becomes highly ranked. (In addition,
w0 could link back to F thus returning rank to the
farm.)
11
Link Spamming: Defenses
Manual identification of spam pages and farms to create a
blacklist.
Automatic classification of pages using machine learning
techniques.
BadRank algorithm. The "bad rank" is initialized to a high
value for blacklisted pages. It propagates bad rank to all
referring pages (with a damping factor) thus penalizing
pages that refer to spam.
12
Search Engine Friendly Pages
Good ways to get your page indexed and ranked highly
• Use straightforward URLs, with simple structure, that do not
change with time.
• Submit your site to be crawled.
• Provide a site map of the pages that you wish to be crawled.
• Have the words that you would expect to see in queries:
- in the content of your pages.
- in <title> and <meta name = "keywords" and "description"> tags.
• Attempt to have links to your page from appropriate authorities.
• Avoid suspicious behavior.
13
Legal Issues in Web Searching
Copyright
In US law, the creator of a Web page (or the employer) owns the
copyright, with a few exceptions. Copyright gives the owner
exclusive right to: reproduce, distribute, perform, display, or
license others to reproduce, distribute, perform, or display.
Search engines operate under an untested legal concept of an
implied license. The concept is to assume that somebody who
puts a Web page online expects users to download it, read it,
index it, etc., unless the copyright owner explicitly states
otherwise.
Historically, Web companies have been cautious, but recently
Google has been pushing the legal limits.
14
Economic Models for Content and
Services on the Web
Mounting information on the Web or supplying services
costs money. Who pays?
Open access
• Externally funded from other funds (standard model).
• Advertising (e.g., Web search).
Restricted access
• Subscription (e.g., journal publishers).
• Pay by use (rare).
Note that these same four models are used for television
15
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
16
Information about Individuals
Advertising is most effective if it is tailored to the individual
Portals, such as Yahoo or Google, have many ways of gaining
information about users:
• identity tracked by cookie or login
• search terms used, pages retrieved, advertisements clicked
• use of other services, e.g., travel, shopping, maps
Data mining such information can provide valuable services, but
raises serious concerns about privacy.
17
How many of these
services collect
information about
the user?
18
Adding Audience Information to Ranking
Conventional information retrieval:
A given query returns the same set of hits, ranked in the
same sequence, irrespective of who submitted the query.
If the search service has information about the user:
The results set and/or the ranking can be varied to match the
user's profile
Example: In an educational digital library, the order of search
results can be varied for:
instructor v. student
grade level of course
19
Adding Audience Information to Ranking
Metadata based methods:
• Label documents with controlled vocabulary to define
intended audience.
• Provide users with means to specify their needs, through a
profile (preferences), or by a query parameter
Automatic methods:
• Capture persistent information about user behavior by data
mining
• Adjust tf.idf rankings using terms derived from terms
previously use by the user
20
Download