Nilay Khandelwal - Center for Software Engineering

advertisement
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
The Original
Google Paper
Google is the common spelling of googol, or
10100, which fit well with the authors’ goal of
building very large-scale search engines.
Outline
 Design goals
 System features
 System anatomy
 Results and performance
 Paper analysis
Design Goals
Design Goals
Thousands
1. Scale with the rapid growth of the web
1,200,000
1,000,000
1,000,000
800,000
600,000
400,000
200,000
0
100,000
110
2
1994
Webpages Indexed
20,000
1997
100,000
2000
Queries/day
Design Goals
2.
3.
Improved Search Quality

Number of documents on the web are increasing
rapidly, but users’ ability to look at them lags.

Current search engines return lots of “junk” results,
with little relevance. (Note: We’re talking about the
year 1998)
Academic Search Engine Research

Push more development and understanding into the
academic realm.

Systems that reasonable number of people can use.

Build an architecture to support novel research
activities in large-scale web data.
System Features
System Features
1. Makes use of the link structure of
the Web to calculate a quality
ranking for each page, called the
PageRank.
 A probability distribution used to
represent the likelihood that a person
randomly clicking on links will arrive at
any particular page.
 It considers the importance of each
page that casts a vote, as votes from
some pages are considered to have
greater value, thus giving the linked
page greater value.
PageRank: Bringing Order to
the Web
é
PR(Ti ) ù
ú
PR(A) = (1- d) + d ê å
êëTi ÎL( A) C(Ti ) úû
 PR(A)  PageRank of a webpage A
 PR(Ti)  PageRank of a webpage Ti pointing to A
 C(Ti)  Number of outbound links for webpage Ti
 L(A)  Set of webpages linking to A
 d  damping factor, a value between 0 and 1, is the
probability that a random surfer will stop clicking
 Note that PageRanks form a probability distribution of
webpages, so the summation of all webpages will be 1.
PageRank: Bringing Order to
the Web
 Assume a universe of 4 webpages: A, B, C,
and D
PR(A) =
PR(B) PR(C) PR(D)
+
+
2
1
3
 Taking into consideration that a random
surfer will eventually stop clicking, we
assume a damping factor, d, which is
generally assumed to be 0.85
é PR(B) PR(C) PR(D) ù
PR(A) = (1- d) + d ê
+
+
ú
ë 2
1
3 û
System Features
2. Makes use of Anchor text of links on webpages:
 E.g. <a href=http://www.yahoo.com>Yahoo!</a>
 Text of a link is not only associated with the
webpage it is on, it also gives information
(sometimes more relevant) to the webpage it points
to.
 Anchors may exist for documents which generally
cannot be indexed by text-based search engines,
such as images, programs, and databases.
System Features
3. Uses location information for all hits and thus
makes extensive use of proximity in search.
4. Keeps track of visual presentation of text on
webpages such as font sizes. Words with
bolder/larger font are given more importance.
5. Stores complete raw HTML of webpages in
repository.
System Anatomy
Major Data Structures
1. BigFiles
 Virtual files spanning multiple file systems and
addressable by 64 bit integers.
2. Repository
 Contains full compressed HTML of all pages.
 Stored one after another prefixed with docID,
length, and URL.
 Compressed using high speed compression
technique (zlib) instead of high compression ratio
(bzip).
Major Data Structures
3. Document Index
 Keeps information about each document.
 It’s a fixed width index, ordered by docID.
 Stores document status, pointer into the repository,
and checksum.
 If document is indexed, points to a variable width file
docinfo which contains URL and title. Else points to
URLlist containing only the URL.
4. Lexicon
 Contains list of null separated words (about 14
million) and hash table of pointers.
Major Data Structures
5. Hit Lists
 A list of occurrences of a particular word in a
particular document including position, font, and
capitalization information.
 Hit lists account for most of the space used in both
the forward and the inverted indices.
6. Forward Index
 Stored in a number of barrels.
 If a document contains words that fall into a
particular barrel, the docID is recorded into the
barrel followed by a list of wordIDs with their hitlists.
Major Data Structures
7. Inverted Index
 The inverted index consists of the same barrels as the
forward index, except that they have been
processed by the sorter.
Crawling the Web
1. Several distributed crawlers.
 URLserver serves list of URLs to the crawler.
 Each crawler keeps ~300 open connections.
 At max, a system of 4 crawlers can crawl ~100
pages/sec or ~600 K/second of data.
 Each maintains it’s own DNS cache for fast lookup.
2. Parser handles a huge array of possible errors
including HTML errors, non-ASCII characters, or
HTML tags nested hundreds deep
Indexing the Web
3. Indexing Documents into Barrels
 After each document is parsed, it is encoded into a
number of barrels.
 Every word is converted into a wordID using an inmemory hash table – the lexicon.
 Once words are converted into wordIDs, their
occurrences in the current document are translated
into hit lists and are written into the forward barrels.
4. Sorting
 Sorter takes each of the forward barrels and sorts by
wordID to produce an inverted barrel for title and
anchor hits and full text inverted barrel.
Searching
1.
2.
3.
4.
5.
6.
7.
Parse the query
Convert words into wordIDs.
Seek to the start of the doclist in the short barrel for
every word.
Scan through the doclists until there is a document
that matches all the search terms.
Compute the rank of that document for the query.
If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.
If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank
and return the top k.
Results and
Performance
Results and Performance
 A qualitative analysis of the search results by
users has generally been positive.
 The current version of Google answers most
queries in between 1 and 10 seconds.
 Since Google takes into consideration the
proximity of word occurrences, results are more
relevant than other search engines giving a set of
results for all words in queries. (E.g. search for ‘bill
clinton’ gives lower importance to results with
independent ‘bill’ and ‘clinton’)
Future Works
 Current version of Google search times are
dominated by disk IO. Introduce query caching,
and hardware, software and algorithmic
optimizations.
 Improve search efficiency and quickly scale to
~100 million web pages.
 Develop Google as a resource for large scale
research tool for searchers and researchers.
Analyses of the Research
Paper
 Pros
 One of the first descriptions of the PageRank
algorithm which changed how search engines
ranked and indexed the web.
 Using citation graph and anchor text to rank pages
closely resembled user behavior of ranking websites.
 Google is a complete architecture for gathering
web pages, indexing them, and performing search
queries over them.
 The paper mentions Google does not compromise
PageRanks for monetary gains giving more
credibility to search results. This holds true to date.
Analyses of the Research
Paper
 Cons
 One of the first flaws found in the PageRank
algorithm was the “Google Bomb”:
 Because of the PageRank, a page will be ranked
higher if the sites that link to that page use
consistent anchor text.
 A Google bomb is created if a large number of
sites link to the page in this manner.
 Ranking quality is insufficient using only PageRank
and anchor text. (Google today uses more than 200
different parameters to judge quality of a
webpage.)
Thank You
Presented by: Nilay Khandelwal
Download