Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns Beyond Macroscopic Structure • Broder et al. report on coarse overall structure of the web • Use and construction of the web are more fine-grained – people browse the web for certain information or topics – people build pages that link to related or “similar” pages • How do we quantify & analyze this more detailed structure? • We’ll examine two related examples: – Kleinberg’s hubs and authorities • automatic identification of “web communities” – PageRank • automatic identification of “important” pages • one of the main criteria used by Google – both rely mainly on the link structure of the web – both have an algorithm and a theory supporting them Hubs and Authorities • Suppose we have a large collection of pages on some topic • • • • – possibly the results of a standard web search Some of these pages are highly relevant, others not at all How could we automatically identify the important ones? What’s a good definition of importance? Kleinberg’s idea: there are two kinds of important pages: – authorities: highly relevant pages – hubs: pages that point to lots of relevant pages • If you buy this definition, it further stands to reason that: – a good hub should point to lots of good authorities – a good authority should be pointed to by many good hubs – this logic is, of course, circular • We need some math and an algorithm to sort it out The HITS System (Hyperlink-Induced Topic Search) • Given a user-supplied query Q: – assemble root set S of pages (e.g. first 200 pages by AltaVista) – grow S to base set T by adding all pages linked (undirected) to S – might bound number of links considered from each page in S • Now consider directed subgraph induced on just pages in T • For each page p in T, define its – hub weight h(p); initialize all to be 1 – authority weight a(p); initialize all to be 1 • Repeat “forever”: – a(p) := sum of h(q) over all pages q p – h(p) := sum of a(q) over all pages p q – renormalize all the weights • This algorithm will always converge! – weights computed related to eigenvectors of connectivity matrix – further substructure revealed by different eigenvectors • Here are some examples The PageRank Algorithm • Let’s define a measure of page importance we will call the rank • Notation: for any page p, let – N(p) be the number of forward links (pages p points to) – R(p) be the (to-be-defined) rank of p • Idea: important pages distribute importance over their forward links • So we might try defining – – – – R(p) := sum of R(q)/N(q) over all pages q p can again define iterative algorithm for computing the R(p) if it converges, solution again has an eigenvector interpretation problem: cycles accumulate rank but never distribute it • The fix: – R(p) := [sum of R(q)/N(q) over all pages q p] + E(p) – E(p) is some external or exogenous measure of importance – some technical details omitted here (e.g. normalization) • Let’s play with the PageRank calculator G B D A C E F The “Random Surfer” Model • Let’s suppose that E(p) sums to 1 (normalized) • Then the resulting PageRank solution R(p) will – also be normalized – can be interpreted as a probability distribution • R(p) is the stationary distribution of the following process: – – – – starting from some random page, just keep following random links if stuck in a loop, jump to a random page drawn according to E(p) so surfer periodically gets “bored” and jumps to a new page E(p) can thus be personalized for each surfer • An important component of Google’s search criteria But What About Content? • PageRank and Hubs & Authorities – both based purely on link structure – often applied to an pre-computed set of pages filtered for content • So how do (say) search engines do this filtering? • This is the domain of information retrieval Basics of Information Retrieval • Represent a document as a “bag of words”: – – – – – – for each word in the English language, count number of occurences so d[i] is the number of times the i-th word appears in the document usually ignore common words (the, and, of, etc.) usually do some stemming (e.g. “washed” “wash”) vectors are very long (~100Ks) but very sparse need some special representation exploiting sparseness • Note all that we ignore or throw away: – the order in which the words appear – the grammatical structure of sentences (parsing) – the sense in which a word is used • firing a gun or firing an employee – and much, much more… Bag of Words Document Comparison • View documents as vectors in a very high-dimensional space • Can now import geometry and linear algebra concepts • Similarity between documents d and e: – S d[i]*e[i] over all words i – may normalize d and e first – this is their projection onto each other • Improve by using TF/IDF weighting of words: – term frequency --- how frequent is the word in this document? – inverse document frequency --- how frequent in all documents? – give high weight to words with high TF and low IDF • Search engines: – view the query as just another “document” – look for similar documents via above Looking Ahead: Left Side vs. Right Side • So far we are discussing the “left hand” search results on Google • “Right hand” or “sponsored” search: paid advertisements in a formal market • • Same two types of search/results on Yahoo!, MSN,… Common perception: • But both sides are subject to “gaming” (strategic behavior)… • … and perhaps to outright fraud • More later… – a.k.a “organic” search; – We will spend a lecture on these markets later in the term – – organic results are “objective”, based on content, importance, etc. sponsored results are subjective advertisements – – – organic: invisible terms in the html, link farms and web spam, reverse engineering sponsored: bidding behavior, “jamming” optimization of each side has its own industry: SEO and SEM – – organic: typo squatting sponsored: click fraud