Lecture 5: Search Engines

advertisement
Lecture 5:
Search Engines
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Outline
• Search engines: key tools for ecommerce
– Buyers and sellers must find each other
•
•
•
•
How do they work?
How much do they index?
How are hits ordered?
Can the order be changed?
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engines
• Tools for finding information on the Web
– Problem: “hidden” databases, e.g. New York Times
• Directory
– A hand-constructed hierarchy of topics (e.g. Yahoo)
• Search engine
– A machine-constructed index (usually by keyword)
• So many search engines, we now need search
engines to find them. Searchenginecollosus.com
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Indexing
• Arrangement of data (data structure) to permit fast
searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps. Why?
– Permits binary search. About log2n probes into list
• log2(1 billion) ~ 30
– Permits interpolation search. About log2(log2n) probes
• log2 log2(1 billion) ~ 5
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Inverted Files
POS
1
10
20
30
36
FILE
A file is a list of words by position
– First entry is the word in position 1 (first word)
– Entry 4562 is the word in position 4562 (4562nd word)
– Last entry is the last word
An inverted file is a list of positions by word!
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
20-751 ECOMMERCE TECHNOLOGY
INVERTED FILE
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Inverted Files for Multiple Documents
LEXICON
WORD
jezebel
OCCUR
POS 1
POS 2
...
NDOCS PTR
20
jezer
3
jezerit
1
jeziah
1
jeziel
1
jezliah
1
jezoar
1
jezrahliah
1
jezreel
DOCID
39
jezoar
20-751 ECOMMERCE TECHNOLOGY
34
44
56
6
3
4
1
215
5
118
2291
22
2087
3010
134
566
3
203
245
287
67
1
132
4
6
1
3
322
15
481
42
FALL 2003
3922
3981
5002
1951
2192
992
WORD
INDEX
...
107
232
677
713
“jezebel” occurs
6 times in document 34,
3 times in document 44,
4 times in document 56 . . .
354
195
381
248
312
802
405
1897
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Architecture
• Spider
– Crawls the web to find pages. Follows hyperlinks.
Never stops
• Indexer
– Produces data structures for fast searching of all words in
the pages
• Retriever
– Query interface
– Database lookup to find hits
• 2 billion documents
• 4 TB RAM, many terabytes of disk
– Ranking
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Crawlers (Spiders, Bots)
• Retrieve web pages for indexing by search engines
• Start with an initial page P0. Find URLs on P0 and add them to a
queue
• When done with P0, pass it to an indexing program, get a page
P1 from the queue and repeat
• Can be specialized (e.g. only look for email addresses)
• Issues
– Which page to look at next? (Special subjects, recency)
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Query Specification
• Boolean
– AND , OR, NOT, PHRASE “ ”, NEAR ~
– But keyword query is artificial
• Question-answering (simulated)
– “Who offers a master’s degree in ecommerce?
• Date range
• Relevance specification
– In Altavista, can specify terms by importance (separate from
query specification)
• Content
– multimedia, MP3, .PPT files
• Stemming: eat, eats, eaten, eating, eater, (ate!)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
“Advanced” Query Specification
• Multimedia, e.g. Google
• Date range
• Relevance specification
– In Altavista, can specify terms by importance (separate from
query specification)
• Content
– multimedia, MP3, .PPT files
• Stemming
• Language
• Search depth (from site’s front page)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Ranking (Scoring) Hits
• Hits must be presented in some order
• What order?
– Relevance, recency, popularity, reliability?
• Some ranking methods
–
–
–
–
Presence of keywords in title of document
Closeness of keywords to start of document
Frequency of keyword in document
Link popularity (how many pages point to this one)
• Can the user control? Can the page owner control?
• Can you find out what order is used?
• Spamdexing: influencing retrieval ranking by altering
a web page. (Puts “spam” in the index)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Google’s PageRank Algorithm
• Assumption: A link in page A to page B is a
recommendation of page B by the author of A
(we say B is successor of A)
The “quality” of a page is related to the number of
links that point to it (its in-degree)
• Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
PageRank Algorithm (Brinn & Page, 1998)
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Definition of PageRank
• Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds
• to a randomly chosen web page with probability d
• to a randomly chosen successor of the current page with
probability 1-d
• The PageRank of a page p is the fraction of steps
the surfer spends at p as the number of steps
approaches infinity
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
PageRank Formula
d
PageRank ( p)   (1  d )  PageRank (q) / outdegree(q)
n
( q , p )E
where n is the total number of nodes in the graph
• Google uses d  0.85
• PageRank is a probability distribution over web
pages
• The sum of all PageRanks of all Pages is 1
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
PageRank Example
B
A
d
d
P
PageRank of P is
(1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n
PAGERANK CALCULATOR
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Link Popularity
• How many pages link to this page?
– on the whole Web
– in our database?
• www.linkpopularity.com
• Link popularity is used for ranking
– Many measures
– Number of links in
– Weighted number of links in (by weight of referring page)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Sizes (Sept. 2, 2003)
BILLIONS OF PAGES
ATW
AV
GG
INK
TMA
SEARCHES/DAY
(MILLIONS)
250
80
18
2900 per second!
20-751 ECOMMERCE TECHNOLOGY
AllTheWeb
Altavista
Google
Inktomi
Teoma
SOURCE: SEARCHENGINEWATCH.COM
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Usage
SHARE BY SEARCH SITE
SHARE BY ENGINE
SOURCE: SEARCHENGINEWATCH.COM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engines Disjointness
Four searches, 10 engines, total of 141 hits on March 6, 2002
SOURCE: SEARCHENGINESHOWDOWN
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine EKG
Shows activity of the Lycos crawler
at one sample site, calafia.com, by
number of pages visited during each crawl
20-751 ECOMMERCE TECHNOLOGY
SOURCE: SEARCHENGINEWATCH.COM
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
SOURCE: SEARCHENGINEWATCH.COM
Search Engine EKG Comparison
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Differences
• Coverage (number of documents)
• Spidering algorithms (visit SpiderCatcher)
– Frequency, depth of visits
•
•
•
•
Inexing policies
Search interfaces
Ranking
One solution: use a metasearcher (search agent)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Metasearchers
• All the engines operate differently. Different
–
–
–
–
–
–
sizes
query languages
crawling algorithms
storage policies (stop words, punctuation, fonts)
freshness
ranking
• Submit the same query to many engines and collect
the results
• Metacrawler
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Clustering
• Viewing large numbers of unstructured hits is not
useful
• Answer: cluster them
• Vivisimo
• Kartoo
• iBoogie
• SurfWax
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Spying
•
•
•
•
•
•
•
•
•
Peeking at queries as they are being submitted
AllTheWeb
Metaspy. Spies on Metacrawler
AskJeeves
Epicurious (recipes)
StockCharts.com
Yahoo buzz index
Kanoodle
IQSeek
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Time Spent Per Visitor (minutes)
by Search Engine, Jan. 2003
Up 58% in ONE YEAR!
AJ
AOL
AV
ELNK
GG
ISP
LS
LY
MSN
NS
OVR
YH
Ask Jeeves
America Online
Altavista
EarthLink
Google
InfoSpace
LookSmart
Lycos
Microsoft
Netscape
OVERTURE
Yahoo
SOURCE: SEARCHENGINEWATCH.COM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Audience Reach
by Search Site, Jan, 2003
AJ
AOL
AV
ELNK
GG
ISP
LS
LY
MSN
NS
OVR
YH
Audience Reach =
% of active surfers
visiting during month.
Totals exceed 100%
because of overlap
Ask Jeeves
America Online
Altavista
EarthLink
Google
InfoSpace
LookSmart
Lycos
Microsoft
Netscape
OVERTURE
Yahoo
SOURCE: SEARCHENGINEWATCH.COM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Robot Exclusion
• You may not want certain pages indexed but still
viewable by browsers. Can’t protect directory.
• Some crawlers conform to the Robot Exclusion
Protocol. Compliance is voluntary. One way to
enforce: firewall
• They look for file robots.txt at highest directory level
in domain. If domain is www.ecom.cmu.edu,
robots.txt goes in www.ecom.cmu.edu/robots.txt
• A specific document can be shielded from a crawler
by adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Robots Exclusion Protocol
• Format of robots.txt
– Two fields. User-agent to specify a robot
– Disallow to tell the agent what to ignore
• To exclude all robots from a server:
User-agent: *
Disallow: /
• To exclude one robot from two directories:
User-agent: WebCrawler
Disallow: /news/
Disallow: /tmp/
• View the robots.txt specification.
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Key Takeaways
•
•
•
•
•
•
Engines are a critical Web resource
Very sophisticated, high technology
They don’t cover the Web completely
Spamdexing is a problem
New paradigms needed as Web grows
What about images, music, video?
– www.corbis.com, Google images
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Q&A
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Download