The Web Graph and PageRank Networked Life Prof. Michael Kearns Lecture Roadmap • • • • Viewing the Web as a directed graph or network Structure of the Web Measures of vertex importance in a network Google’s PageRank algorithm Directed Networks • The web graph is a directed network: – – – – page A can point/link to page B, without a reciprocal link represent directed edges with arrows each web page/site thus has “in-links” and “out-links” in-degree = # of in-links; out-degree = # of out-links B D A C E F The Web as Network • What is the structure of this network? – connected components – degree distributions – etc. • What does it say about the people building and using it? – page and link generation – visitation statistics • What are the algorithmic consequences? – web search – community identification Graph Structure in the Web [Broder et al. paper] • Report on the results of two massive “web crawls” • Executed by AltaVista in May and October 1999 • Details of the crawls: – – – – automated script following hyperlinks (URLs) from pages found large set of starting points collected over time crawl implemented as breadth-first search have to deal with webspam, infinite paths, timeouts, duplicates, etc. • May ’99 crawl: – 200 million pages, 1.5 billion links • Oct ’99 crawl: – 271 million pages, 2.1 billion links • Unaudited, self-reported Sep ’03 stats: – 3 major search engines claim > 3 billion pages indexed Five Easy Pieces • Authors did two kinds of breadth-first search: – ignoring link direction weak connectivity – only following forward links strong connectivity • They then identify five different regions of the web: – strongly connected component (SCC): • can reach any page in SCC from any other in directed fashion – component IN: • can reach any page in SCC in directed fashion, but not reverse – component OUT: • can be reached from any page in SCC, but not reverse – component TENDRILS: • weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) – SCC+IN+OUT+TENDRILS form weakly connected component (WCC) – everything else is called DISC (disconnected from the above) – here is a visualization of this structure Size of the Five • • • • • • • SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: – SCC: the heart of the web – IN: newer sites not yet discovered and linked to – OUT: “insular” pages like corporate web sites Diameter Measurements • Directed worst-case diameter of the SCC: – at least 28 • Directed worst-case diameter of IN SCC OUT: – at least 503 • Over 75% of the time, there is no directed path between a random start and finish page in the WCC – when there is a directed path, average length is 16 • Average undirected distance in the WCC is 7 • Moral: – web is a “small world” when we ignore direction – otherwise the picture is more complex Degree Distributions • They are, of course, heavy-tailed • Power law distribution of component size – consistent with the Erdos-Renyi model • Undirected connectivity of web not reliant on “connectors” – what happens as we remove high-degree vertices? Important Vertices • So far: have emphasized macroscopic aspects of network structure – – – – degree distribution: all degrees diameter: average over all vertex pairs connectivity: giant component with most vertices clustering: average over all vertices, compare to overall edge density – – – – high degree link between communities betweeness centrality Google’s PageRank algorithm • Also interesting to identify “important” individuals within the network • Some purely structural definitions of importance: Background on Web Search • Most important sources of information: words in query and on web pages • For instance, on query “mountain biking”: – realize “bike” and “bicycle” are synonymous under context “mountain” – but “bike” and “motorcylce” are not – find documents with these words and their correlates (“Trek”, “trails”) • Subject of the field of information retreival • But many other “features” or “signals” may be useful for identifying good or useful sites: – font sizes – frequency of exclamation marks • PageRank idea: use the link structure of the web Directed Networks • Q: What constitutes an “important” vertex in a directed network? – could just use in-degree; view directed links as referrals • PageRank answer: an important page is pointed to by lots of other important pages • This, of course, is circular… B D A C E F The PageRank Algorithm • • • • • Suppose p and q are pages where q p Let R(q) be rank of q; let out(q) be out-degree of q Idea: q “distributes” its rank over its out-links Each out-link of q receives R(q)/out(q) So then p’s rank should be: R( p) R(q) /out(q) qPOINTS( p ) B D A C E F The PageRank Algorithm • What guarantee do we have that all these equations are consistent? • Idea: turn the equations into an update rule (algorithm) • At each time step, pick some vertex p to update, and perform: R( p) R(q) /out(q) qPOINTS( p ) • Right-hand side R(q) are current/frozen values; R(p) is new rank of p • Initialize all R(p) to 1 • Claim: Under broad conditions, this algorithm will converge: – at some point, updates no longer change any values – have found solution to all the R(p) equations 0.31 0.15 B C A 0.38 E D 0.38 F 0.64 0.54 0.15 G 0.21 0.40 B C A 0.44 E D 0.44 F 0.70 0.67 Summary • PageRank defines importance in a circular or self-referential fashion • Circularity is broken with a simple algorithm that provably converges • PageRank defines R(p) globally; more subtle than local properties of p