Google View January 25, 2007 The Anatomy of a Large-Scale Hypertextual Web Search Engine 1) Section 4.1 and Figure 1 a) URLserver- sends lists of URLs to be fetched by the crawlers (URL- address of a web page) b) Crawler- using a number of distributed crawlers, web pages are downloaded c) The pages are compressed as they are put into the Repository d) The Repository is the necessary step in between the Crawler and the Index i) Cached- how the search engine saw the page e) This information is then sent to the Indexer i) The indexer maps words to page IDs ii) Page IDs keep the size of the Index small iii) It reads the repository, uncompresses the documents, and parses them (1) Parsing- going through a page and extracting terms f) While everything is being put into the Indexer an Anchor is being formulated g) Anchor- stores the pages that are associated with anchor text on the pages in the Indexer i) For every term there is a hit made and a hit list is created (1) Hits- refer to term/ page/ position/ and font information storage (a) Used to compute proximity and importance of words (b) Fancy Hit- a hit in the title, heading, or anchor text in any page linking to that page h) Barrels- Have some with a forward index and then it sorts them into an inverted index which is sent to the searcher i) Forward index- lists DocIDs and the term that will open this file ii) Inverted index- lists terms and the DocIDs containing the term i) PageRank- used to sort the pages in terms of relevance to the query Hypersearching the Web 1) Clever Project i) IBM Almaden, Cornell, Berkeley (members) b) Deals with the chaotic growth of the web i) Many pages added per day ii) No control over contents of web page iii) No control over the web’s link graph c) Extensive use of hyperlinks d) Hubs and Authorities 2) Clever takes the first 200 terms from another search engine (Altavista) i) Does not like their rankings b) Authority- The relevance of the page to particular terms. Pointed to by many hubs. c) Hub- Points to many authorities. A page with many links to authorities d) Page Categories i) Useless ii) Hubs iii) Authorities iv) Hub and Authority (i.e., Wikipedia) e) To find hubs and authorities they use the link structure i) Clever takes the first 200 pages ii) Sees all of their links and gets those pages too. iii) The ones that are linked to more often are considered to be higher ranking.