The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns The Web as Network • Consider the web as a network – vertices: individual (html) pages – edges: hyperlinks between pages – will view as both a directed and undirected graph • What is the structure of this network? – connected components – degree distributions – etc. • What does it say about the people building and using it? – page and link generation – visitation statistics • What are the algorithmic consequences? – web search – community identification Graph Structure in the Web [Broder et al. paper] • Report on the results of two massive “web crawls” • Executed by AltaVista in May and October 1999 • Details of the crawls: – – – – automated script following hyperlinks (URLs) from pages found large set of starting points collected over time crawl implemented as breadth-first search have to deal with webspam, infinite paths, timeouts, duplicates, etc. • May ’99 crawl: – 200 million pages, 1.5 billion links • Oct ’99 crawl: – 271 million pages, 2.1 billion links • Unaudited, self-reported Sep ’03 stats: – 3 major search engines claim > 3 billion pages indexed Five Easy Pieces • Authors did two kinds of breadth-first search: – ignoring link direction weak connectivity – only following forward links strong connectivity • They then identify five different regions of the web: – strongly connected component (SCC): • can reach any page in SCC from any other in directed fashion – component IN: • can reach any page in SCC in directed fashion, but not reverse – component OUT: • can be reached from any page in SCC, but not reverse – component TENDRILS: • weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) – SCC+IN+OUT+TENDRILS form weakly connected component (WCC) – everything else is called DISC (disconnected from the above) – here is a visualization of this structure Size of the Five • • • • • • • SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: – SCC: the heart of the web – IN: newer sites not yet discovered and linked to – OUT: “insular” pages like corporate web sites Diameter Measurements • Directed worst-case diameter of the SCC: – at least 28 • Directed worst-case diameter of IN SCC OUT: – at least 503 • Over 75% of the time, there is no directed path between a random start and finish page in the WCC – when there is a directed path, average length is 16 • Average undirected distance in the WCC is 7 • Moral: – web is a “small world” when we ignore direction – otherwise the picture is more complex Degree Distributions • They are, of course, heavy-tailed • Power law distribution of component size – consistent with the Erdos-Renyi model • Undirected connectivity of web not reliant on “connectors” – what happens as we remove high-degree vertices? Beyond Macroscopic Structure • Such studies tell us the coarse overall structure of the web • Use and construction of the web are more fine-grained – people browse the web for certain information or topics – people build pages that link to related or “similar” pages • How do we quantify & analyze this more detailed structure? • We’ll examine two related examples: – Kleinberg’s hubs and authorities • automatic identification of “web communities” – PageRank • automatic identification of “important” pages • one of the main criteria used by Google – both rely mainly on the link structure of the web – both have an algorithm and a theory supporting it Hubs and Authorities • Suppose we have a large collection of pages on some topic • • • • – possibly the results of a standard web search Some of these pages are highly relevant, others not at all How could we automatically identify the important ones? What’s a good definition of importance? Kleinberg’s idea: there are two kinds of important pages: – authorities: highly relevant pages – hubs: pages that point to lots of relevant pages • If you buy this definition, it further stands to reason that: – a good hub should point to lots of good authorities – a good authority should be pointed to by many good hubs – this logic is, of course, circular • We need some math and an algorithm to sort it out The HITS System (Hyperlink-Induced Topic Search) • Given a user-supplied query Q: – assemble root set S of pages (e.g. first 200 pages by AltaVista) – grow S to base set T by adding all pages linked (undirected) to S – might bound number of links considered from each page in S • Now consider directed subgraph induced on just pages in T • For each page p in T, define its – hub weight h(p); initialize all to be 1 – authority weight a(p); initialize all to be 1 • Repeat “forever”: – a(p) := sum of h(q) over all pages q p – h(p) := sum of a(q) over all pages p q – renormalize all the weights • This algorithm will always converge! – weights computed related to eigenvectors of connectivity matrix – further substructure revealed by different eigenvectors • Here are some examples The PageRank Algorithm • Let’s define a measure of page importance we will call the rank • Notation: for any page p, let – N(p) be the number of forward links (pages p points to) – R(p) be the (to-be-defined) rank of p • Idea: important pages distribute importance over their forward links • So we might try defining – – – – R(p) := sum of R(q)/N(q) over all pages q p can again define iterative algorithm for computing the R(p) if it converges, solution again has an eigenvector interpretation problem: cycles accumulate rank but never distribute it • The fix: – R(p) := [sum of R(q)/N(q) over all pages q p] + E(p) – E(p) is some external or exogenous measure of importance – some technical details omitted here (e.g. normalization) • Let’s play with the PageRank calculator The “Random Surfer” Model • Let’s suppose that E(p) sums to 1 (normalized) • Then the resulting PageRank solution R(p) will – also be normalized – can be interpreted as a probability distribution • R(p) is the stationary distribution of the following process: – – – – starting from some random page, just keep following random links if stuck in a loop, jump to a random page drawn according to E(p) so surfer periodically gets “bored” and jumps to a new page E(p) can thus be personalized for each surfer • An important component of Google’s search criteria But What About Content? • PageRank and Hubs & Authorities – both based purely on link structure – often applied to an pre-computed set of pages filtered for content • So how do (say) search engines do this filtering? • This is the domain of information retrieval Basics of Information Retrieval • Represent a document as a “bag of words”: – – – – – – for each word in the English language, count number of occurences so d[i] is the number of times the i-th word appears in the document usually ignore common words (the, and, of, etc.) usually do some stemming (e.g. “washed” “wash”) vectors are very long (~100Ks) but very sparse need some special representation exploiting sparseness • Note all that we ignore or throw away: – the order in which the words appear – the grammatical structure of sentences (parsing) – the sense in which a word is used • firing a gun or firing an employee Bag of Words Document Comparison • View documents as vectors in a very high-dimensional space • Can now import geometry and linear algebra concepts • Similarity between documents d and e: – S d[i]*e[i] over all words i – may normalize d and e first – this is their projection onto each other • Improve by using TF/IDF weighting of words: – term frequency --- how frequent is the word in this document? – inverse document frequency --- how frequent in all documents? – give high weight to words with high TF and low IDF • Search engines: – view the query as just another “document” – look for similar documents via above