Introduction

advertisement

Discussion Report

The Anatomy of a Large-Scale Hypertextual Web Search Engine

By Jason Wang

Introduction

Everyday there are millions more people who are using the web thus creating more and more problems. There are thousands of web pages sprouting up daily and developers a Google have developed a way to sort through all of that information.

Design Goals

There are a couple of things that everyone desires from a search engine. Those things are the speed of the result, the quantities of the results, and the relevancies of the results. After all, we do not want a million results that have nothing to do with what we are searching for.

Google System Features

Google, a second-generation search engine brings a couple of radical new thoughts to the table.

It introduces the idea Page Rank and Anchor Text. Page rank is the idea that every page on web votes through linking its page with other pages. One can imagine it as a random surfer surfing the internet and the page rank is the probability at which the surfer can be on that website at any given time. This way, the number of links pointing to that page will give it a high page rank.

With this in mind, spammers on the internet attempt to create an artificially high page rank by creating a bunch of spam pages that have a link pointing to their desired web page. Google and

Page Rank address this problem by determining that the pages who point to you that have a high page rank will be worth more than other pages who do not have a high page rank. This formula can be given by PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)).

Another system feature that Google can boast is intelligent use of Anchor Text. Anchor text can be thought of as the invisible text that goes along with any link, picture, or video. This text is determined by the author of the page or the person who created the picture or video. The advantage of utilizing this text is now that Google is able to bring you relevant songs, pictures, and even videos not to mention enhanced search accuracy. Another advantage to this enables websites who have opted out not to be crawled are able to still come up as search results.

Spammers try to take advantage of this anchor text by putting random things that have no relevance to the picture or link. Google fixes this problem by making sure that there the anchor text is somewhere on the page or other link’s anchor text that is pointing to that page has matches up with the anchor text that the author puts.

Apart from Page Rank and Anchor Text, Google incorporates the use of other little items such as the proximity of words, font size, color, and where it is on the page. For example: When there are 2 words together that match the search close to each other on the page, this page will most likely be more relevant than if they were really far apart. Larger font will be more relevant than smaller font, and things with different colors or come early in the site are most likely more important than black and white text or that come later in the page. Spammers have attempted to

take advantage of this by making everything a super large font but Google has battled back by having the font have to be a contrast with all of the other fonts on its page.

The System Architecture

Have you wondered what is going behind the scenes when you searched? Every time a user types in a query into the while empty box, then end up with their desired results in a matter of milliseconds. The behind the empty box all begins with the URL server. You can go to Google yourself and request you own URL to be crawled and your URL will simply go to the URL server. This server tells the Crawlers where to crawl. Google normally uses 3 Crawlers at one time and these Crawlers can crawl roughly 100 pages per second and keep open roughly 300 connections at once. After the crawlers crawl, it is given to the Store Server where all of the data is compressed. This compressed form of data is then given to the Repository where it is stored.

The Indexer takes the files from the Repository and sorts them into the Barrels. There are 2 different Barrels that exist. The Indexer creates the forward barrels and later the Sorter takes the information in the forward index and takes them to create the inverted index. This way when the user searches, it is able to match the word with the website rather than the website with the words. The Lexicon also exists right next to the barrels. It can be viewed as the location where only the words are stored. It is like all of the text-based information within all of the information crawled. The Indexer, while indexing, also takes all of the anchor text and links and is sent to different places where the page rank is computed while the Document Index has a brief summary of what every page is about that is also given to the searcher.

The Storage

The total size of the Forward Barrels is approximately 43GB while the Inverted Barrels are about

41GB. The Lexicon is rather small comparatively with 293MB. Within all of these spaces, is a complex way Google developed to store data. For every word they find on a web page, called a

“hit,” Google compresses this data down to about 2 bytes. Within these 2 bytes of data, it lists whether the text is a plain text, fancy text, or anchor text. Plain text is just the mundane black and white version of the text while the fancy text is all of the colors, font size, and other characteristics of the text. The 2 bytes of information also tell you where it is located on the page by associated different positions with a different numerical value. Within the Forward Barrels, there is a number with every page called the document ID. Within each one of the document ID’s there exists a number of word ID’s. The word ID’s can be thought of as little subsets of the document ID that represent a word within the document. Here we see the Inverted Barrels after the sorter has finished sorting through it, (Notice that everything is now inverted from word ID into document ID, the opposite of the way it was in the Forward Barrels.) We can now understand why the Lexicon is so small because it only includes the word ID’s and is able to match it up with the correct document ID that one desires. This is very much like when one searches for a word in the query.

Conclusion

Google brings up something that never existed before. An amazing way to access information and connected everyone in the world. In the future they plan on adding Boolean operators, negation, and stemming and hopefully a more personalized version of Page Rage.

Download