search

advertisement
Information Discovery
Lecture 21
Web Search 3
Effective Information Retrieval
1. Comprehensive metadata with Boolean retrieval
(e.g., monograph catalog).
Can be excellent for well-understood categories of
material, but requires expensive metadata, which is
rarely available.
2. Full text indexing with ranked retrieval (e.g., news
articles).
Excellent for relatively homogeneous material, but
requires available full text.
Neither of these methods is very effective when
applied directly to the Web.
Effective Information Retrieval (cont)
3. Full text indexing with contextual
information and ranked retrieval (e.g., Google,
Teoma).
Excellent for mixed textual information with rich
structure.
4. Contextual information with non-textual
materials and ranked retrieval (e.g., Google and
Yahoo image retrieval).
Promising, but still experimental.
New concepts in Web Searching
•
Goal of search is redefined to emphasize
precision of the most highly ranked group of
hits.
•
Concept of relevance is changed to include
importance of documents as a factor in
ranking.
•
Browsing is tightly connected to
searching.
•
Contextual information is used as an
integral part of the search.
Browsing
Users give queries of 2 to 4 words
Most users click only on the first few results;
few go beyond the fold on the first page
80% of users, use search engine to find
sites
search to find site
browse to find information
Amil Singhal, Google, 2004
Browsing and Searching
Searching is followed by browsing.
Browsing the hit list:
helpful summary records (snippets)
removal of duplicates
grouping results from a single site
Browsing the web pages themselves:
direct links from the snippets to the pages
cache with highlights
translation in same format
Dynamic Snippets
Query: Cornell sports
LII: Law about...Sports
... sports law: an overview. Sports Law
encompasses a multitude areas of law brought
together in unique ways. Issues ... vocation.
Amateur Sports. ...
www.law.cornell.edu/topics/sports.html
Query: NCAA Tarkanian
LII: Law about...Sports
... purposes. See NCAA v. Tarkanian, 109 US
454 (1988). State action status may also be a
factor in mandatory drug testing rules. On ...
www.law.cornell.edu/topics/sports.html
Contextual information
The context in which an item exists may give useful
information for searching.
Information about a document:
• Content (terms, formatting, etc.)
• Metadata (externally created following rules)
• Context (citations and links, reviews, annotations,
etc.)
Context has many uses:
• Selecting documents to index
• Retrieval clues (e.g., anchor text)
• Ranking
Context: Anchor Text
Linking
page
words words
words Cornell
University words
words words
<a href =
"http://www.cornell.edu"
>Cornell University</a>
HTML source
Linked to page
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Context: Image Searching
HTML source
<img
src="images/Arms.jp
g" alt="Photo of
William Arms">
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Captions and other
adjacent text on the
web page
From the Information Science web
site
Reference Pattern Ranking using Dynamic Document
Sets
PageRank calculates document ranks for the entire (fixed)
set of documents. The calculations are made periodically
(e.g., monthy) and the document ranks are the same for all
queries.
Concept of dynamic document sets. Reference patterns
among documents that are related to a specific query
convey more information than patterns calculated across
entire document collections.
With dynamic document sets, references patterns are
calculated for a set of documents that are selected based
on each individual query.
Reference Pattern Ranking using Dynamic Document
Sets
Teoma Dynamic Ranking Algorithm (used in Ask
Jeeves)
1. Search using conventional term weighting. Rank the hits
using similarity between query and documents.
2. Select the highest ranking hits (e.g., top 5,000 hits).
3. Carry out PageRank or similar algorithm on this set of
hits. This creates a set of document ranks that are specific
to this query.
4. Display the results ranked in the order of the reference
patterns calculated.
Scalability
10,000,000,000
1,000,000,000
The growth
of the web
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
1994
1997
2000
Scalability
Web search services are centralized systems
• Over the past 9 years, Moore's Law has enabled the
services to keep pace with the growth of the web
and the number of users, while adding extra
function.
• Will this continue?
• Possible areas for concern are: staff costs,
telecommunications costs, disk access rates.
Growth of Web Searching
In November 1997:
• AltaVista was handling 20 million searches/day.
• Google forecast for 2000 was 100s of millions of
searches/day.
In 2004, Google reports 250 million webs searches/day, and
estimates that the total number over all engines is 500 million
searches/day.
Moore's Law and web searching
In 7 years, Moore's Law predicts computer power will increase
by a factor of at least 24 = 16.
It appears that computing power is growing at least as fast as
web searching.
Growth of Google
In 2000: 85 people
50% technical, 14 Ph.D. in Computer Science
In 2000: Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
Reported by Larry Page, Google, March 2000
At that time, Google was handling 5.5 million searches per
day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people.
Scalability: Performance
Very large numbers of commodity
computers
Algorithms and data structures scale
linearly
• Storage
 Scale with the size of the Web
 Compression/decompression
• System
 Crawling, indexing, sorting simultaneously
• Searching
 Bounded by disk I/O
Software and Hardware Replication
Search service
index
server
index
server
index
server
index
indexserver
server
index
indexserver
server
document
document
document
server
document
server
document
server
document
server
document
server
server
server
advertiseme
advertiseme
advertiseme
nt
server
advertiseme
nt
server
advertiseme
nt
server
advertiseme
nt
ntserver
server
nt server
spell
spellchecking
checking
spell
checking
spell
checking
spell
checking
spell
spellchecking
checking
Scalability: Numbers of Computers
Very rough calculation
In March 2000, 5.5 million searches per day, required 2,500
computers
In fall 2004, computers are about 8 times more powerful.
Estimated number of computers for 250 million searches per
day:
= (250/5.5) x 2,500/8
= about 15,000
Some industry estimates suggest that Google may have as
many as 100,000 computers.
Scalability: Staff
Programming: Have very well trained staff.
Isolate complex code. Most coding is single
image.
System maintenance: Organize for minimal staff
(e.g., automated log analysis, do not fix broken
computers).
Customer service: Automate everything possible,
but complaints, large collections, etc. require
staff.
Evaluation Web Searching
Test corpus must be dynamic
The web is dynamic (10%-20%) of URLs change every
month
Spam methods change change continually
Queries are time sensitive
Topic are hot and then not
Need to have a sample of real queries
Languages
At least 90 different languages
Reflected in cultural and technical differences
Amil Singhal, Google, 2004
Other Uses of Web Crawling and Associated
Technology
The technology developed for web search services
has many other applications.
Conversely, technology developed for other
Internet applications can be applied in web
searching
• Related objects (e.g., Amazon's "Other people
bought the following").
• Recommender and reputation systems (e.g.,
ePinion's reputation system).
Google API
Selective
searching
Google
News
Download