CS/INFO 430 Information Retrieval Web Search 3 Lecture 21

advertisement
CS/INFO 430
Information Retrieval
Lecture 21
Web Search 3
1
Course Administration
Wednesday, November 16
No discussion class
Thursday, November 17
No lecture
No office hours
2
Scalability
Question: How big is the Web and how fast is it growing?
Answer: Nobody knows
Estimates of the Crawled Web:
1994
100,000 pages
1997
1,000,000 pages
2000
1,000,000,000 pages
2005
8,000,000,000 pages
Rough estimates of the Crawlable Web suggest at least 4x
Rough estimates of the Deep Web suggest at least 100x
3
Scalability: Software and Hardware
Replication
Search service
index
server
index
server
index
server
index
indexserver
server
index
indexserver
server
document server
document
server
document
server
document
server
document
server
document
documentserver
server
4
advertisement
advertisement
advertisement
server
advertisement
server
advertisement
server
advertisement
server
server
server
spell
spellchecking
checking
spell
checking
spell
checking
spell
checking
spell
spellchecking
checking
Scalability: Large-scale Clusters of
Commodity Computers
"Component failures are the norm rather than the exception....
The quantity and quality of the components virtually guarantee
that some are not functional at any given time and some will not
recover from their current failures. We have seen problems
caused by application bugs, operating system bugs, human
errors, and the failures of disks, memory, connectors,
networking, and power supplies...."
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The
Google File System." 19th ACM Symposium on Operating
Systems Principles, October 2003.
http://www.cs.rochester.edu/sosp2003/papers/p125ghemawat.pdf
5
Scalability: Performance
Very large numbers of commodity computers
Algorithms and data structures scale linearly
• Storage
– Scale with the size of the Web
– Compression/decompression
• System
– Crawling, indexing, sorting simultaneously
• Searching
– Bounded by disk I/O
6
Scalability: Numbers of Computers
Very rough calculation
In March 2000, 5.5 million searches per day, required 2,500
computers
In fall 2004, computers are about 8 times more powerful.
Estimated number of computers for 250 million searches per day:
= (250/5.5) x 2,500/8
= about 15,000
Some industry estimates (based on Google's capital expenditure)
suggest that Google or Yahoo may have had as many as 80,000
computers in spring 2005.
7
Scalability of Staff: Growth of Google
In 2000: 85 people
50% technical, 14 Ph.D. in Computer Science
In 2000: Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
Reported by Larry Page, Google, March 2000
At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people.
In 2004, Google hired 1,000 new people.
8
Scalability: Staff
Programming: Have very well trained staff. Isolate complex
code. Most coding is single image.
System maintenance: Organize for minimal staff (e.g.,
automated log analysis, do not fix broken computers).
Customer service: Automate everything possible, but
complaints, large collections, etc. still require staff.
9
Scalability of Staff:
The Neptune Project
The Neptune Clustering Software:
Programming API and runtime support, which allows a
network service to be programmed quickly for execution on a
large-scale cluster in handling high-volume user traffic.
The system shields application programmers from the
complexities of replication, service discovery, failure
detection and recovery, load balancing, resource monitoring
and management.
Tao Yang, University of California, Santa Barbara
http://www.cs.ucsb.edu/projects/neptune/
10
Scalability: the long term
Web search services are centralized systems
• Over the past 9 years, Moore's Law has enabled the
services to keep pace with the growth of the web and the
number of users, while adding extra function.
• Will this continue?
• Possible areas for concern are: staff costs,
telecommunications costs, disk and memory access rates,
equipment costs.
11
Growth of Web Searching
In November 1997:
• AltaVista was handling 20 million searches/day.
• Google forecast for 2000 was 100s of millions of searches/day.
In 2004, Google reported 250 million webs searches/day, and
estimated that the total number over all engines was 500 million
searches/day.
Moore's Law and Web searching
In 7 years, Moore's Law predicts computer power increased by a
factor of at least 24 = 16.
It appears that computing power is growing at least as fast as web
searching.
12
Search Engine Spam: Objective
Success of commercial Web sites depends on the number of
visitors that find the site while searching for a particular
product.
85% of searchers look at only the first page of results
A new business sector – search engine optimization
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in
web search engines. International Joint Conference on
Artificial Intelligence, 2003.
Drost, I. and Scheffer, T., Thwarting the Nigritude
Ultramarine: Learning to Identify Link Spam. 16th European
Conference on Machine Learning, Porto, 2005
13
Search Engine Spam: Techniques
Text based:
Add keywords to a page in the hope that search engines
will index it, e.g., in meta-tags, in special type of format,
etc.
Cloaking:
Return different page to Web crawlers than to ordinary
downloads. (Can also be used to help Web search, e.g.,
by providing a text version of a highly visual page.)
Link-based: (see next slide)
14
Link Spamming: Techniques
Link farms: Densely connected arrays of pages. Farm pages
propagate their PageRank to the target, e.g., by a funnelshaped architecture that points directly or indirectly towards
the target page. To camouflage link farms, tools fill in
inconspicuous content, e.g., by copying news bulletins.
Link exchange services: Listings of (often unrelated)
hyperlinks. To be listed, businesses have to provide a back
link that enhances the PageRank of the exchange service.
Guestbooks, discussion boards, and weblogs: Automatic
tools post large numbers of messages to many sites; each
message contains a hyperlink to the target website.
15
Link Spamming: Defenses
Manual identification of spam pages and farms to create a
blacklist.
Automatic classification of pages using machine learning
techniques.
BadRank algorithm. The "bad rank" is initialized to a high
value for blacklisted pages. It propagates bad rank to all
referring pages (with a damping factor) thus penalizing
pages that refer to spam.
16
Search Engine Friendly Pages
Good ways to get your page indexed and ranked highly
• Use straightforward URLs, with simple structure, which do not
change with time.
• Submit your site to be crawled.
• Provide a site map of the pages that you wish to be crawled.
• Have the words that you would expect to see in queries:
- in the content of your pages.
- in <title> and <meta name = "keywords" and "description"> tags
• Attempt to have links to your page from appropriate authorities.
• Avoid suspicious behavior.
17
Adding audience information to
ranking
Conventional information retrieval:
A given query returns the same set of hits, ranked in the
same sequence, irrespective of who submitted the query.
If the search service has information about the user:
The results set and/or the ranking can be varied to match the
user's profile
Example: In an educational digital library, the order of search
results can be varied for:
instructor v. student
grade level of course
18
Adding audience information to
ranking
Metadata based methods:
• Label documents with controlled vocabulary to define
intended audience.
• Provide users with means to specify their needs, either through
a profile (preferences) or by a query parameter
Automatic methods
• Capture persistent information about user behavior
• Adjust tf.idf rankings using terms derived from user behavior
Data-mining to capture user information raises privacy concerns
19
How many of these
services collect
information about
the user?
20
Other Uses of Web Crawling and
Associated Technology
The technology developed for Web search services has many
other applications.
Conversely, technology developed for other Internet
applications can be applied in Web searching
• Related objects (e.g., Amazon's "Other people bought the
following").
• Recommender and reputation systems (e.g., ePinion's
reputation system).
21
Context: Image Searching
HTML source
<img
src="images/Arms.jpg"
alt="Photo of William
Arms">
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Captions and other
adjacent text on the
web page
From the Information Science web site
22
Browsing
Users give queries of 2 to 4 words
Most users click only on the first few results; few
go beyond the fold on the first page
80% of users, use search engine to find sites
search to find site
browse to find information
Amil Singhal, Google, 2004
Browsing is a major topic in the lectures on Usability
23
Evaluation Web Searching
Test corpus must be dynamic
The web is dynamic (10%-20%) of URLs change every month
Spam methods change change continually
Queries are time sensitive
Topic are hot and then not
Need to have a sample of real queries
Languages
At least 90 different languages
Reflected in cultural and technical differences
Amil Singhal, Google, 2004
24
Download