Discussion Report
The Anatomy of a Large-Scale Hypertextual Web Search Engine
By Jason Wang
Everyday there are millions more people who are using the web thus creating more and more
problems. There are thousands of web pages sprouting up daily and developers a Google have
developed a way to sort through all of that information.
Design Goals
There are a couple of things that everyone desires from a search engine. Those things are the
speed of the result, the quantities of the results, and the relevancies of the results. After all, we do
not want a million results that have nothing to do with what we are searching for.
Google System Features
Google, a second-generation search engine brings a couple of radical new thoughts to the table.
It introduces the idea Page Rank and Anchor Text. Page rank is the idea that every page on web
votes through linking its page with other pages. One can imagine it as a random surfer surfing
the internet and the page rank is the probability at which the surfer can be on that website at any
given time. This way, the number of links pointing to that page will give it a high page rank.
With this in mind, spammers on the internet attempt to create an artificially high page rank by
creating a bunch of spam pages that have a link pointing to their desired web page. Google and
Page Rank address this problem by determining that the pages who point to you that have a high
page rank will be worth more than other pages who do not have a high page rank. This formula
can be given by PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)).
Another system feature that Google can boast is intelligent use of Anchor Text. Anchor text can
be thought of as the invisible text that goes along with any link, picture, or video. This text is
determined by the author of the page or the person who created the picture or video. The
advantage of utilizing this text is now that Google is able to bring you relevant songs, pictures,
and even videos not to mention enhanced search accuracy. Another advantage to this enables
websites who have opted out not to be crawled are able to still come up as search results.
Spammers try to take advantage of this anchor text by putting random things that have no
relevance to the picture or link. Google fixes this problem by making sure that there the anchor
text is somewhere on the page or other link’s anchor text that is pointing to that page has matches
up with the anchor text that the author puts.
Apart from Page Rank and Anchor Text, Google incorporates the use of other little items such as
the proximity of words, font size, color, and where it is on the page. For example: When there
are 2 words together that match the search close to each other on the page, this page will most
likely be more relevant than if they were really far apart. Larger font will be more relevant than
smaller font, and things with different colors or come early in the site are most likely more
important than black and white text or that come later in the page. Spammers have attempted to
take advantage of this by making everything a super large font but Google has battled back by
having the font have to be a contrast with all of the other fonts on its page.
The System Architecture
Have you wondered what is going behind the scenes when you searched? Every time a user types
in a query into the while empty box, then end up with their desired results in a matter of
milliseconds. The behind the empty box all begins with the URL server. You can go to Google
yourself and request you own URL to be crawled and your URL will simply go to the URL
server. This server tells the Crawlers where to crawl. Google normally uses 3 Crawlers at one
time and these Crawlers can crawl roughly 100 pages per second and keep open roughly 300
connections at once. After the crawlers crawl, it is given to the Store Server where all of the data
is compressed. This compressed form of data is then given to the Repository where it is stored.
The Indexer takes the files from the Repository and sorts them into the Barrels. There are 2
different Barrels that exist. The Indexer creates the forward barrels and later the Sorter takes the
information in the forward index and takes them to create the inverted index. This way when the
user searches, it is able to match the word with the website rather than the website with the
words. The Lexicon also exists right next to the barrels. It can be viewed as the location where
only the words are stored. It is like all of the text-based information within all of the information
crawled. The Indexer, while indexing, also takes all of the anchor text and links and is sent to
different places where the page rank is computed while the Document Index has a brief summary
of what every page is about that is also given to the searcher.
The Storage
The total size of the Forward Barrels is approximately 43GB while the Inverted Barrels are about
41GB. The Lexicon is rather small comparatively with 293MB. Within all of these spaces, is a
complex way Google developed to store data. For every word they find on a web page, called a
“hit,” Google compresses this data down to about 2 bytes. Within these 2 bytes of data, it lists
whether the text is a plain text, fancy text, or anchor text. Plain text is just the mundane black and
white version of the text while the fancy text is all of the colors, font size, and other
characteristics of the text. The 2 bytes of information also tell you where it is located on the page
by associated different positions with a different numerical value. Within the Forward Barrels,
there is a number with every page called the document ID. Within each one of the document ID’s
there exists a number of word ID’s. The word ID’s can be thought of as little subsets of the
document ID that represent a word within the document. Here we see the Inverted Barrels after
the sorter has finished sorting through it, (Notice that everything is now inverted from word ID
into document ID, the opposite of the way it was in the Forward Barrels.) We can now
understand why the Lexicon is so small because it only includes the word ID’s and is able to
match it up with the correct document ID that one desires. This is very much like when one
searches for a word in the query.
Google brings up something that never existed before. An amazing way to access information
and connected everyone in the world. In the future they plan on adding Boolean operators,
negation, and stemming and hopefully a more personalized version of Page Rage.