PowerPoint - Cornell University

advertisement
CS 430: Information Discovery
Lecture 21
Web Search 3
1
Course Administration
Thursday, November 11
No office hours
Tuesday, November 16
No class
Wednesday, November 17
Discussion class requires you to read three short papers.
Wednesday, December 1
Discussion class requires you to search for and read
materials on a specified topic.
2
Effective Information Retrieval
1. Comprehensive metadata with Boolean retrieval (e.g.,
monograph catalog).
Can be excellent for well-understood categories of material,
but requires expensive metadata, which is rarely available.
2. Full text indexing with ranked retrieval (e.g., news
articles).
Excellent for relatively homogeneous material, but requires
available full text.
Neither of these methods is very effective when applied
directly to the Web.
3
Effective Information Retrieval (cont)
3. Full text indexing with contextual information and
ranked retrieval (e.g., Google, Teoma).
Excellent for mixed textual information with rich
structure.
4. Contextual information with non-textual
materials and ranked retrieval (e.g., Google and Yahoo
image retrieval).
Promising, but still experimental.
4
New concepts in Web Searching
5
•
Goal of search is redefined to emphasize
precision of the most highly ranked group of
hits.
•
Concept of relevance is changed to include
importance of documents as a factor in ranking.
•
Browsing is tightly connected to searching.
•
Contextual information is used as an integral
part of the search.
Browsing
Users give queries of 2 to 4 words
Most users click only on the first few results; few
go beyond the fold on the first page
80% of users, use search engine to find sites
search to find site
browse to find information
Amil Singhal, Google, 2004
6
Browsing and Searching
Searching is followed by browsing.
Browsing the hit list:
helpful summary records (snippets)
removal of duplicates
grouping results from a single site
Browsing the web pages themselves:
direct links from the snippets to the pages
cache with highlights
translation in same format
7
Dynamic Snippets
Query: Cornell sports
LII: Law about...Sports
... sports law: an overview. Sports Law encompasses
a multitude areas of law brought together in unique
ways. Issues ... vocation. Amateur Sports. ...
www.law.cornell.edu/topics/sports.html
Query: NCAA Tarkanian
LII: Law about...Sports
... purposes. See NCAA v. Tarkanian, 109 US 454
(1988). State action status may also be a factor in
mandatory drug testing rules. On ...
www.law.cornell.edu/topics/sports.html
8
Contextual information
The context in which an item exists may give useful
information for searching.
Information about a document:
• Content (terms, formatting, etc.)
• Metadata (externally created following rules)
• Context (citations and links, reviews, annotations, etc.)
Context has many uses:
• Selecting documents to index
• Retrieval clues (e.g., anchor text)
• Ranking
9
Context: Anchor Text
Linking page
Linked to page
words words words
Cornell University
words words words
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
<a href =
"http://www.cornell.edu">
Cornell University</a>
10
HTML source
Context: Image Searching
HTML source
<img
src="images/Arms.jpg"
alt="Photo of William
Arms">
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Captions and other
adjacent text on the
web page
From the Information Science web site
11
Reference Pattern Ranking using
Dynamic Document Sets
PageRank calculates document ranks for the entire (fixed) set of
documents. The calculations are made periodically (e.g., monthy)
and the document ranks are the same for all queries.
Concept of dynamic document sets. Reference patterns among
documents that are related to a specific query convey more
information than patterns calculated across entire document
collections.
With dynamic document sets, references patterns are calculated
for a set of documents that are selected based on each individual
query.
12
Reference Pattern Ranking using
Dynamic Document Sets
Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)
1. Search using conventional term weighting. Rank the hits using
similarity between query and documents.
2. Select the highest ranking hits (e.g., top 5,000 hits).
3. Carry out PageRank or similar algorithm on this set of hits.
This creates a set of document ranks that are specific to this query.
4. Display the results ranked in the order of the reference patterns
calculated.
13
Scalability
10,000,000,000
1,000,000,000
The growth
of the web
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
1994
14
1997
2000
Scalability
Web search services are centralized systems
• Over the past 9 years, Moore's Law has enabled the
services to keep pace with the growth of the web and the
number of users, while adding extra function.
• Will this continue?
• Possible areas for concern are: staff costs,
telecommunications costs, disk access rates.
15
Growth of Web Searching
In November 1997:
• AltaVista was handling 20 million searches/day.
• Google forecast for 2000 was 100s of millions of searches/day.
In 2004, Google reports 250 million webs searches/day, and
estimates that the total number over all engines is 500 million
searches/day.
Moore's Law and web searching
In 7 years, Moore's Law predicts computer power will increase by
a factor of at least 24 = 16.
It appears that computing power is growing at least as fast as web
searching.
16
Growth of Google
In 2000: 85 people
50% technical, 14 Ph.D. in Computer Science
In 2000: Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
Reported by Larry Page, Google, March 2000
At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people.
In 2004, Google plans to hire 1,000 new people.
17
Scalability: Performance
Very large numbers of commodity computers
Algorithms and data structures scale linearly
• Storage
– Scale with the size of the Web
– Compression/decompression
• System
– Crawling, indexing, sorting simultaneously
• Searching
– Bounded by disk I/O
18
Software and Hardware Replication
Search service
index
server
index
server
index
server
index
indexserver
server
index
indexserver
server
document server
document
server
document
server
document
server
document
server
document
documentserver
server
19
advertisement
advertisement
advertisement
server
advertisement
server
advertisement
server
advertisement
server
server
server
spell
spellchecking
checking
spell
checking
spell
checking
spell
checking
spell
spellchecking
checking
Scalability: Numbers of Computers
Very rough calculation
In March 2000, 5.5 million searches per day, required 2,500
computers
In fall 2004, computers are about 8 times more powerful.
Estimated number of computers for 250 million searches per day:
= (250/5.5) x 2,500/8
= about 15,000
Some industry estimates suggest that Google may have as many as
100,000 computers.
20
Scalability: Staff
Programming: Have very well trained staff. Isolate complex
code. Most coding is single image.
System maintenance: Organize for minimal staff (e.g.,
automated log analysis, do not fix broken computers).
Customer service: Automate everything possible, but
complaints, large collections, etc. require staff.
21
Evaluation Web Searching
Test corpus must be dynamic
The web is dynamic (10%-20%) of URLs change every month
Spam methods change change continually
Queries are time sensitive
Topic are hot and then not
Need to have a sample of real queries
Languages
At least 90 different languages
Reflected in cultural and technical differences
Amil Singhal, Google, 2004
22
Other Uses of Web Crawling and
Associated Technology
The technology developed for web search services has many
other applications.
Conversely, technology developed for other Internet
applications can be applied in web searching
• Related objects (e.g., Amazon's "Other people bought the
following").
• Recommender and reputation systems (e.g., ePinion's
reputation system).
23
Google API
24
Selective
searching
25
Google
News
26
Download