Google View January 25, 2007

advertisement
Google View
January 25, 2007
The Anatomy of a Large-Scale Hypertextual Web Search Engine
1) Section 4.1 and Figure 1
a) URLserver- sends lists of URLs to be fetched by the crawlers (URL- address of a
web page)
b) Crawler- using a number of distributed crawlers, web pages are downloaded
c) The pages are compressed as they are put into the Repository
d) The Repository is the necessary step in between the Crawler and the Index
i) Cached- how the search engine saw the page
e) This information is then sent to the Indexer
i) The indexer maps words to page IDs
ii) Page IDs keep the size of the Index small
iii) It reads the repository, uncompresses the documents, and parses them
(1) Parsing- going through a page and extracting terms
f) While everything is being put into the Indexer an Anchor is being formulated
g) Anchor- stores the pages that are associated with anchor text on the pages in the
Indexer
i) For every term there is a hit made and a hit list is created
(1) Hits- refer to term/ page/ position/ and font information storage
(a) Used to compute proximity and importance of words
(b) Fancy Hit- a hit in the title, heading, or anchor text in any page linking
to that page
h) Barrels- Have some with a forward index and then it sorts them into an inverted
index which is sent to the searcher
i) Forward index- lists DocIDs and the term that will open this file
ii) Inverted index- lists terms and the DocIDs containing the term
i) PageRank- used to sort the pages in terms of relevance to the query
Hypersearching the Web
1) Clever Project
i) IBM Almaden, Cornell, Berkeley (members)
b) Deals with the chaotic growth of the web
i) Many pages added per day
ii) No control over contents of web page
iii) No control over the web’s link graph
c) Extensive use of hyperlinks
d) Hubs and Authorities
2) Clever takes the first 200 terms from another search engine (Altavista)
i) Does not like their rankings
b) Authority- The relevance of the page to particular terms. Pointed to by many
hubs.
c) Hub- Points to many authorities. A page with many links to authorities
d) Page Categories
i) Useless
ii) Hubs
iii) Authorities
iv) Hub and Authority (i.e., Wikipedia)
e) To find hubs and authorities they use the link structure
i) Clever takes the first 200 pages
ii) Sees all of their links and gets those pages too.
iii) The ones that are linked to more often are considered to be higher ranking.
Download