CS/INFO 430 Information Retrieval Web Search 3 Lecture 17

advertisement
CS/INFO 430
Information Retrieval
Lecture 17
Web Search 3
1
Course Administration
2
Information Retrieval Using PageRank
Simple Method: Rank by Popularity
Consider all hits (i.e., all document that match the query in
the Boolean sense) as equal.
Display the hits ranked by PageRank.
The disadvantage of this method is that it gives no
attention to how closely a document matches a query
3
Combining Term Weighting with
Reference Pattern Ranking
Combined Method
1. Find all documents that contain the terms in the query vector.
2. Let sj be the similarity between the query and document j,
calculated using tf.idf or a related method.
3. Let pj be the popularity of document j, calculated using
PageRank or another measure of importance.
4. The combined rank cj = sj + (1- )pj, where  is a constant.
5. Display the hits ranked by cj.
4
Questions about PageRank
Most pages have very small page ranks
• For searches that return large numbers of hits, there are usually a
reasonable number of pages with high PageRank.
• For searches that return smaller numbers of hits, e.g, highly
specific queries, all the pages may have very small PageRanks, so
that it is difficult to rank them in a sensible order.
Example
A search by a customer for information about a product may rank
a large number of mail order businesses that sell the product
above the manufacturer's site that provides a specification for
the product. Small numbers of links may make big changes to rank.
5
Advanced Graphical Methods:
www.teoma.com
• Carry out a search
• Divide Web sites found by a search into clusters, known as
communities
• Calculate authority within communities
• Calculate hubs within communities, known as experts
Note: Teoma does not publish the precise algorithms it uses
6
Other Factors in Ranking
Coefficient sj and pj may be varied by adding other evidence.
Similarity ranking sj might weight:
• structural mark-up, e.g., headings, bold, etc.
• meta-tags
• anchor text and adjacent text in the linking page
• file names
Popularity ranking pj might weight:
• usage data of page
• previous searches by same user
7
Anchor Text and Adjacent Text
Anchor text
Document
A provides
information
about
document B
8
Adjacent
text
Anchor Text and File Names
The source of Document A contains the marked-up text:
<a href="http://www.cis.cornell.edu/">The Faculty of
Computing and Information Science</a>
This string provides the following index terms about
Document B:
Anchor text: faculty, computing, information, science
File name: cis, cornell
Note: A specific stop list is needed for each category
of text.
9
Indexing Non-Textual Materials
Factors that can be used to index non-textual materials:
• anchor text, including <alt> tags
• text adjacent to an anchor
• file names
• PageRank
This is the concept behind image searching on the Web.
10
Context: Image Searching
HTML source
<img
src="images/Arms.jpg"
alt="Photo of William
Arms">
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Captions and other
adjacent text on the
web page
From the Information Science web site
11
Evaluation Web Searching
Test corpus must be dynamic
The web is dynamic (10%-20%) of URLs change every month
Spam methods change change continually
Queries are time sensitive
Topic are hot and then not
Need to have a sample of real queries
Languages
At least 90 different languages
Reflected in cultural and technical differences
Amil Singhal, Google, 2004
12
Evaluation: Search + Browse
Users give queries of 2 to 4 words
Most users click only on the first few results; few
go beyond the fold on the first page
80% of users, use search engine to find sites:
search to find site
browse to find information
Amil Singhal, Google, 2004
Browsing is a major topic in the lectures on Usability
13
Evaluation:
The Human in the Loop
Return objects
Return
hits
Browse documents
Search index
14
Scalability
Question: How big is the Web and how fast is it growing?
Answer: Nobody knows
Estimates of the Crawled Web:
1994
100,000 pages
1997
1,000,000 pages
2000
1,000,000,000 pages
2005
8,000,000,000 pages
Rough estimates of the Crawlable Web suggest at least 4x
Rough estimates of the Deep Web suggest at least 100x
15
Scalability: Software and Hardware
Replication
Search service
index
server
index
server
index
server
index
indexserver
server
index
indexserver
server
document server
document
server
document
server
document
server
document
server
document
documentserver
server
16
advertisement
advertisement
advertisement
server
advertisement
server
advertisement
server
advertisement
server
server
server
spell
spellchecking
checking
spell
checking
spell
checking
spell
checking
spell
checking
spell
checker
Scalability: Large-scale Clusters of
Commodity Computers
"Component failures are the norm rather than the exception....
The quantity and quality of the components virtually guarantee
that some are not functional at any given time and some will not
recover from their current failures. We have seen problems
caused by application bugs, operating system bugs, human
errors, and the failures of disks, memory, connectors,
networking, and power supplies...."
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The
Google File System." 19th ACM Symposium on Operating
Systems Principles, October 2003.
http://portal.acm.org/citation.cfm?doid=945445.945450
17
Scalability: Performance
Very large numbers of commodity computers
Algorithms and data structures scale linearly
• Storage
– Scale with the size of the Web
– Compression/decompression
• System
– Crawling, indexing, sorting simultaneously
• Searching
– Bounded by disk I/O
18
Scalability of Staff: Growth of Google
In 2000: 85 people
50% technical, 14 Ph.D. in Computer Science
In 2000: Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
Reported by Larry Page, Google, March 2000
At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people.
By fall 2006, Google had over 9,000 people.
19
Scalability: Numbers of Computers
Very rough calculation
In March 2000, 5.5 million searches per day, required 2,500
computers
In fall 2004, computers were about 8 times more powerful.
Estimated number of computers for 250 million searches per day:
= (250/5.5) x 2,500/8
= about 15,000
Some industry estimates (based on Google's capital expenditure)
suggest that Google and Yahoo may have had as many as
250,000+ computers in fall 2006.
20
Scalability: Staff
Programming: As the number of programmers grows it
becomes increasingly difficult to maintain the quality of
software.
Have very well trained staff. Isolate complex code. Most
coding is single image.
System maintenance: Organize for minimal staff (e.g.,
automated log analysis, do not fix broken computers).
Customer service: Automate everything possible, but
complaints, large collections, etc. still require staff.
21
Scalability of Staff:
The Neptune Project
The Neptune Clustering Software:
Programming API and runtime support, which allows a
network service to be programmed quickly for execution on a
large-scale cluster in handling high-volume user traffic.
The system shields application programmers from the
complexities of replication, service discovery, failure
detection and recovery, load balancing, resource monitoring
and management.
Tao Yang, University of California, Santa Barbara
http://www.cs.ucsb.edu/projects/neptune/
22
Scalability: the Long Term
Web search services are centralized systems
• Over the past 12 years, Moore's Law has enabled Web
search services to keep pace with the growth of the Web
and the number of users, while adding extra function.
• Will this continue?
• Possible areas for concern are: staff costs,
telecommunications costs, disk and memory access rates,
equipment costs.
23
Growth of Web Searching
In November 1997:
• AltaVista was handling 20 million searches/day.
• Google forecast for 2000 was 100s of millions of searches/day.
In 2004, Google reported 250 million webs searches/day, and
estimated that the total number over all engines was 500 million
searches/day.
Moore's Law and Web searching
In 7 years, Moore's Law predicts computer power increased by a
factor of at least 24 = 16.
It appears that computing power is growing at least as fast as web
searching.
24
Other Uses of Web Crawling and
Associated Technology
The technology developed for Web search services has many
other applications.
Conversely, technology developed for other Internet
applications can be applied in Web searching
• Related objects (e.g., Amazon's "Other people bought the
following").
• Recommender and reputation systems (e.g., ePinion's
reputation system).
25
Download