The anatomy of a Large-Scale Hypertextual Web Search Engine

The anatomy of a Large-Scale
Hypertextual Web Search Engine
What we want from a search
•Quantity of Results
•Efficient Storage Space
•Quality of Results
Google attempts to bring
us all of these aspects
from search.
Precision of result:
Second Generation Search Engine
Page Rank
Anchor Text
The more number of links that is pointing to a page (from other
pages), the higher the page rank will be.
The probability that a random internet surfer will reach this page
by randomly clicking links.
Also determined by the number of links the page has pointing
you have.
The more links page A has, the more valued the link from page A
to B will be.
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
Anchor Text
Each and every link on the internet will have some
“invisible” text alongside it. This text is given by the
page creator explaining what this link does, where
it leads, or what it attempts to explain.
By taking all of these links from hundreds of
different sites, Google uses these anchor text to be
able to provide most relevant search results.
Proximity Search and Others
Google keeps track of how close the
related words are too each other and
also keeps track of the visual
presentation (font size, color, boldness
Crawling and Indexing
•Google typically ran about 3.
•Each crawler opens roughly 300
connections as once.
•At peak performance, with 4 crawlers,
Google can crawl 100 web pages per
•Roughly 600K per second of data.
•Indexing documents into barrels