InternetSearch05 - BNRG - University of California, Berkeley

advertisement
Needle in the Haystack:
The Technology of Internet Search
Randy H. Katz
The United Microelectronics Corporation Distinguished Professor
Computer Science Division, EECS Department
University of California, Berkeley
Berkeley, CA 94720-1776 USA
randy@cs.Berkeley.edu
1
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
2
Search is BIG!
3
And the World is Going Digital
4
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
5
Historical Background:
The Perfect Storm
ARPANet 1969
NSFNet 1985
Commercial Internet 1995
Marc Andreessen
NCSA Mosaic
1993
Jim Clark
Netscape
World Wide Web
1995
Tim Berners-Lee URL/HTTP/HTML 1989
Bill Atkinson Hypercard 1987
SGML 1986
Ted Nelson Xanadu Hypertext 1965-1990
Autodesk
Est. $15.5 Billion spent on-line
Vannevar Bush “As We
Thanksgivings to Xmas 2004,
May Think” MEMEX 1947
up 28% since 2003
6
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
7
Information Tsunami
• Bit: Binary digit – either a 0 or 1
• Byte: 8 bits
– 1 byte: single character
– 10 bytes: a single word
– 100 bytes: Telegram or punched card
• Kilobyte: 1,000 or 103 bytes
–
–
–
–
–
–
1 kilobyte: Very short story
2 kilobytes: Typewritten page
10 kilobytes: Encyclopedia page
50 kilobytes: Compressed document image page
100 kilobytes: Low-res photo
200 kilobytes: Box of punched cards
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
8
Information Tsunami
• Megabyte: 1,000,000 or 106 bytes
–
–
–
–
–
–
1 megabyte: Small novel or 3.5in floppy disk
2 megabytes: Hi-res photo
5 megabytes: Complete works of Shakespeare
10 megabytes: Minute of hi-fi sound
100 megabytes: 1m shelved books
500 megabytes: CD-ROM
–
–
–
–
–
1 gigabyte: Pickup truck filled with paper
2 gigabytes: Movie on a DVD
50 gigabytes: Floor of books
100 gigabytes: Floor of academic journals
500 gigabytes: Biggest FTP site
• Gigabyte: 1,000,000,000 or 109 bytes
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
9
Information Tsunami
• Terabyte: 1,000,000,000,000 or 1012 bytes
– 1 terabyte: 50,000 trees made into paper and printed
or 1 day of EOS data
– 2 terabytes: Academic research library
– 10 terabytes: Printed collection of the U.S. Library of Congress
– 50 terabytes: Contents of a large mass storage system
– 400 terabytes: National Climate Data Center (NOAA) database
• Petabyte: 1,000,000,000,000,000 or 1015 bytes
–
–
–
–
1 petabytes: 3 years of Earth Observing System (EOS) data
2 petabytes: All U.S. academic research libraries
8 petabytes: All information available on the Web
200 petabytes: All printed material (2001)
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
10
Information Tsunami
• Exabyte: 1,000,000,000,000,000,000 or 1018 bytes
– 2 exabytes: Total volume of information generated worldwide
annually
– 5 exabytes: All words ever spoken by humans
• Zettabyte: 1,000,000,000,000,000,000,000 or 1021 bytes
• Yottabyte: 1,000,000,000,000,000,000,000,000 or 1024 bytes
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
11
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
12
Anatomy of a
Web Page:
Randy’s Home Page
• URL: Uniform Resource
Locator
• Images
• Text
13
Anatomy of a Web Page:
Randy’s Home Page
<html>
<head>
<title>Professor Randy Howard Katz University of California Berkeley
Computer Science Division Home Page</title>
<meta name="description“ content="Home Page of Berkeley Computer Science
Professor Randy Howard Katz">
<meta name="keywords“ content="Katz Randy Howard Berkeley Professor
University California Electrical Engineering Computer Science
Department RAID Redundant Arrays Inexpensive Disks SPUR Snoop Wireless
Communications Networks Programmable Network Elements">
</head>
<body>
<p><img height="269" src="Randy_2004.jpg" width="182" align="bottom"
naturalsizeflag="0">   <img height="269" src="RHK85a.jpg"
width="177" align="bottom" naturalsizeflag="0">   </p>
<p><font size="-1">2005 vs. 1985 ... The hair is grayer, but the smirk
remains the same!<br>
<br>
"... Katz, a thin, almost gaunt man with horn-rimmed glasses magnifying
sunken eyes. ..."<br>
--George Johnson, WIRED Magazine, (January 2000), page
14
150.</font></p><p><img src="VISIONAR.JPG" align="bottom"> </p>
• Text
• Images
• Links!
15
Anatomy of a Web Page:
Randy’s Web Page
<hr align="left">
<h1>Professor Randy H. Katz</h1>
<h3>Electrical Engineering and Computer Science
Department</h3>
<p><a href="http://www.umc.com.tw/"><img hspace="6"
src="UMCLogo.gif" align="left"> </a>
<b><font size="+1">The
<a href="http://www.umc.com.tw/">United Microelectronics
Corporation</a> Distinguished Professor</font></b></p>
<p><font size="-1"><br clear="left">
Ph.D., University of California, Berkeley, 1980.<br>
M.S., University of California, Berkeley, 1978.<br>
A.B., Cornell University, 1976.<br>
</font></p>
16
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
17
Anatomy of Web Access
Naming System (DNS):
Name-to-Address Mapping
IP address
Web Page
In HTML
(1)
(2)
Link URL
http://www.umc.com.tw/
Web Browser
Taiwan
(3)
(4)
Web Server
18
Anatomy of Web Access
Content Caching
Naming System (DNS)
Origin IP
Web Page
In HTML
Content Network DNS
Edge Cache IP
(5)
Taiwan
(6)
Link URL
…/English/about/index.asp
(7)
Web Browser
(8)
Content
Distribution
San Jose
Edge
Cache
Origin
Web Server
19
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
20
Challenges of Search
•
•
•
•
How to find all the pages on the Web?
How to order the pages by relevance?
How to make searchable the content on those pages?
How to keep it all up-to-date?
• Web Crawlers/SpiderBots
– Network software executing in parallel that follow links in the
Web to find content
– Web pages “scraped” for more links follow
– Web revisited on the order of once every two-three days
• Indexers
– Web pages “scraped” for search terms to build indexes
– (Google) Page rank algorithm: order a page within the index based
(roughly) on how many pages refer to it
21
Quick (and Incomplete) History
of Search Engines
CMU
Lycos
1st Commercial
Search Engine
Stanford
Yahoo!
Directories
UMinn
MIT
Veronica & Wandex/
Archie
WWW
services Wanderer
for
gopher &
Aliweb
ftp
Pre-Web
1993
1995
a9.com
AlltheWeb
Ask Jeeves
Clusty
Gigablast
Ez2Find
Yahoo!
Teoma
acquires
Overture WiseNut
GoHook
(AlltheWeb, Walhello
AltaVista) Kartoo
Yahoo!
acquires
Inktomi
Battle for Popularity:
Webcrawler (UWash)
HotBot (Wired)
Excite (Stanford)
Infoseek (ABC)
Inktomi (Berkeley)
AltaVista (DEC)
Google (Stanford)
1997
1999
2001
Yahoo!
deploys
joint
technology
2003
2005
22
Search Challenges and Issues
• Web growing faster than search engines can index
• Web pages updated frequently, forcing frequent
revisits
• Key word only searches results in many false positives
• Difficult to index dynamically generated sites: the socalled “invisible web”
• Some search engines order results by financial
“placement” considerations rather than relevance
• Some sites trick search engine to display them first
for some keywords—results in polluted search results,
with more relevant links pushed down among the
results
23
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
24
Page Ranking Algorithms
• Web page relevancy
– Many hits, how to insure the best/most relevant web
pages are presented first in answer to a search
• Location and Frequency of Keywords
– Index terms in page title raise its relevance for that term
– Keywords near “top” of page more relevant than bottom
– High keyword frequency boosts relevance
• If search engine strategy is known, page
developers will “game” the strategy to get
their pages ranked higher
25
Google’s Page Rank Algorithm
• Which is the most important page?
26
Google’s Page Rank Algorithm
• Googlese from their web page:
– PageRank relies on the uniquely democratic
nature of the web by using its vast link
structure as an indicator of an individual
page's value. Google interprets a link from
page A to page B as a vote, by page A, for
page B. But, Google looks at more than the
sheer volume of votes, or links a page
receives; it also analyzes the page that
casts the vote. Votes cast by pages that
are themselves "important" weigh more
heavily and help to make other pages
"important.”
27
Google Page Rank Algorithm
• Basic idea:
– Page’s rank determined by the number of links to the page (also known
as citations)
– If citing page is more important (has a high page rank/authority page)
then the pages it cites are more important
– If citing page has many links, then cited page is less important
(normalize for number of links on citing page)
PR(P) is page rank of page P, T1, …, TN are pages that cite P,
C(P) is the # links from Page P, D is a “decay factor”, e.g., 0.85
then:
PR(P) = (1 – d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
• See http://www-db.stanford.edu/~backrub/google.html
28
Google
Conceptual
Architecture
29
Google Server Architecture
Google
Web Server
Spell Checker
Ad Server
Doc Server
Doc Server
Doc Server
Doc Server
Index Server
•
•
•
•
Doc Server
Doc Server
Doc Server
Doc Server
Doc Server
Index servers: search term partitioned and mapped to doc list
Intersect to find document list, sort by page rank
Document IDs used to extract text from Doc Servers
Over 100,000 processors (and growing) in Googleplex
30
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
31
Fun and Games
•
•
•
•
•
Google Scholar
Googling Someone
Google News
Comparison Shopping
Google Whacks
32
Google Scholar
33
Google Randy
34
Google Randy Katz
“Google
Index”
Advertising
Placement
35
Google News
36
Comparison Shopping
37
elgooG
38
Google Whacks
39
Business Model
Ad Placement and Click-Thru
Old data (2002): Google is now market leader in ad revenue
2004 revenue through 9/30/04: $2.1B
40
Outline
•
•
•
•
•
•
•
•
Historical Background
Information Tsunami
Anatomy of a Web Page
Anatomy of Web Access
The Challenge of Search
Google’s Page Rank Algorithm
Fun and Games with Internet Search
New Directions
41
Top 10 Search Engines
10. DMOZ.org
9. Alltheweb.com
8. KartOO.com
7. MSN.com
6. Dogpile.com
5. AskJeeves.com
4. About.com
2. Yahoo.com
2. Vivismio.com
1. Google.com
42
Clustering
43
Google Video Search
44
Google Video Search
45
Amazon’s A9
46
Amazon’s A9
47
A9’s Yellow Pages
48
A9’s Yellow Pages
49
Innovations Now and
Yet to Come
• Index ever larger portions of the Web, even beyond
traditional web pages, e.g., video
• Better quality/higher relevance searches
• Better presentation of results, e.g., clustering, site
information
• Better exploitation of semantic relationships for
improved page ranking, more personalization, e.g.,
user’s zip code
• More services (Web, news groups, blogs, comparison
shopping, video/audio, yellow pages, etc.)
• Integrate with desktop machine
50
Parting Thoughts
51
Parting Thoughts
52
“Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?”
T.S. Eliot, “Choruses from the rock”, Selected Poems, NY:
Harvest / Harcourt, 1962, p. 107.
53
Needle in the Haystack:
The Technology of Internet Search
Thanks for Your Patience & Attention!
Questions?
54
Download