Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University 1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, S. Brin and L. Page, in Proceeding of WWW’98 2. “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin, R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998 What to cover today PageRank Google Architecture Problem Statement Ultimate version Find what I want In most cases, I don’t know exactly or cannot expressed clearly what I want “What-I-want” can be estimated using a set of keywords Simplified version Find the files that are most related to a set of keywords Naïve Solution How it works Download the entire Internet to a local machine Search and return all files containing the set of keywords Problems: all files are treated equally importance Could return tons of files, but most of them are not what I want Since most users simply check out the first few files, this scheme actually cannot find much useful things Ranking Based on Hit Rate How it works A file is ranked higher if it is visited more frequently Problems Could be affected by faked hits A file will be ranked higher and higher Ranking based on Citation Basic idea A paper is important if it is cited by many papers Each paper has a set of references that link to the related work A pioneering paper typically has a high citation An HTML page is more important if it is linked by many other page Each page may link to other pages Problems Publish of academic papers is well-controlled Many are peer-reviewed Chronically ordered Internet files could be anything Proposed: PageRank Basic idea A page with many links to it is more likely to be useful than one with few links to it Just like citation The links from a page that itself is the target of many links are likely to be particularly important This is something new Proposed: PageRank Basic idea A page with many links to it is more likely to be useful than one with few links to it Just like citation The links from a page that itself is the target of many links are likely to be particularly important This is something new back links forward link Each link has different weight Proposed: PageRank How it works Each page is ranked using a value called PageRank (PR) A page’s PR depends on the PRs of its back link pages PR(A)=(1-d) + d*[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)] d: damping factor, normally this is set to 0.85 T1, … Tn: pages point to page A PR(A): PageRank of page A PR(Ti): PageRank of page Ti pointing to page A C(Ti): the number of links going out of page Ti Proposed: PageRank Properties of PageRank formula PageRanks form a probability distribution over web pages, so the normalized sum of all web pages' PageRanks will be one Challenge of calculating PageRanks The links could be circulated, e.g., ABA Page A Page B PageRank Calculation Assign each page an initial rank value Could be any number (seed) Repeat calculations until the rank of each page does not change much Page A Seed = 1 Page B PR(A)= 0.15 + 0.85 * 1 = 1 PR(B)= 0.15 + 0.85 * 1 = 1 d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1) PageRank Calculation Assign each page an initial rank value Could be any number (seed) Repeat calculations until the rank of each page does not change much Page A Page B Seed = 0 1) PR(A)= 0.15 PR(B)= 0.15 2) PR(A)= 0.15 PR(B)= 0.15 3) PR(A)= 0.15 + 0.85 * 0 = 0.15 + 0.85 * 0.15 = 0.2775 + 0.85 * 0.2775 = 0.385875 + 0.85 * 0.385875 = 0.47799375 + 0.85 * 0.47799375 = 0.5562946875 PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375 d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1) PageRank Calculation Assign each page an initial rank value Could be any number (seed) Repeat calculations until the rank of each page does not change much Page A Seed = 40 1) PR(A)= 0.15 + 0.85 * 40 = 34.25 PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775 Page B 2) PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875 PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375 3) ...... d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1) PageRank Calculation Assign each page an initial rank value Could be any number (seed) Repeat calculations until the rank of each page does not change much Page A Seed = 40 1) PR(A)= 0.15 + 0.85 * 40 = 34.25 PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775 Page B 2) PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875 PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375 3) …… Observation: It doesn’t matter what the seed value you use, once the PageRank calculations settle down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0 Example of Calculation (0) Page A Page B Page C Page D Example of Calculation (1) Page A 1 Page B 1 Page C 1 Page D 1 Example of Calculation (2) Page A 1 1*0.85/2 1*0.85/2 Page B 1 1*0.85 1*0.85 Page C 1 1*0.85 Page D 1 Each page has not passed on 0.15, so we get: Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 Page A 1 Page B 0.575 Page C 2.275 Page D 0.15 Example of Calculation (3) Page A 1 Page B 0.575 Page C 2.275 Page D 0.15 Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred, remains at 0.15 Page A 2.03875 Page B 0.575 Page C 1.1925 Page D 0.15 Example of calculation (4) After 20 iterations, we get Page A 1.490 Page B 0.783 Page C 1.577 Page D 0.15 In reality: a PageRank for 26,000,000 web pages can be computed in a few hours on a medium size workstation. (1998) Result Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page links! More iterations lead to a stability PageRank of the resulting page for keyword research. PageRank Summary PageRank is a citation importance ranking Approximated measure of importance or quality Number of citations or backlinks The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!) PageRank Summary PageRank is a citation importance ranking Approximated measure of importance or quality Number of citations or backlinks Each citation has different weight The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!) Questions: how to improve the ranking of your web pages? Creating dummy sites to link to their main sites? Increasing internal links and/or decreasing external links? Google Architecture