Network Structure and Web Search Networked Life

Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns Beyond Macroscopic Structure • Broder et al. report on coarse overall structure of the web • Use and construction of the web are more fine-grained – people browse the web for certain information or topics – people build pages that link to related or “similar” pages • How do we quantify & analyze this more detailed structure? • We’ll examine two related examples: – Kleinberg’s hubs and authorities • automatic identification of “web communities” – PageRank • automatic identification of “important” pages • one of the main criteria used by Google – both rely mainly on the link structure of the web – both have an algorithm and a theory supporting them Hubs and Authorities • Suppose we have a large collection of pages on some topic • • • • – possibly the results of a standard web search Some of these pages are highly relevant, others not at all How could we automatically identify the important ones? What’s a good definition of importance? Kleinberg’s idea: there are two kinds of important pages: – authorities: highly relevant pages – hubs: pages that point to lots of relevant pages • If you buy this definition, it further stands to reason that: – a good hub should point to lots of good authorities – a good authority should be pointed to by many good hubs – this logic is, of course, circular • We need some math and an algorithm to sort it out The HITS System (Hyperlink-Induced Topic Search) • Given a user-supplied query Q: – assemble root set S of pages (e.g. first 200 pages by AltaVista) – grow S to base set T by adding all pages linked (undirected) to S – might bound number of links considered from each page in S • Now consider directed subgraph induced on just pages in T • For each page p in T, define its – hub weight h(p); initialize all to be 1 – authority weight a(p); initialize all to be 1 • Repeat “forever”: – a(p) := sum of h(q) over all pages q  p – h(p) := sum of a(q) over all pages p  q – renormalize all the weights • This algorithm will always converge! – weights computed related to eigenvectors of connectivity matrix – further substructure revealed by different eigenvectors • Here are some examples The PageRank Algorithm • Let’s define a measure of page importance we will call the rank • Notation: for any page p, let – N(p) be the number of forward links (pages p points to) – R(p) be the (to-be-defined) rank of p • Idea: important pages distribute importance over their forward links • So we might try defining – – – – R(p) := sum of R(q)/N(q) over all pages q  p can again define iterative algorithm for computing the R(p) if it converges, solution again has an eigenvector interpretation problem: cycles accumulate rank but never distribute it • The fix: – R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p) – E(p) is some external or exogenous measure of importance – some technical details omitted here (e.g. normalization) • Let’s play with the PageRank calculator G B D A C E F The “Random Surfer” Model • Let’s suppose that E(p) sums to 1 (normalized) • Then the resulting PageRank solution R(p) will – also be normalized – can be interpreted as a probability distribution • R(p) is the stationary distribution of the following process: – – – – starting from some random page, just keep following random links if stuck in a loop, jump to a random page drawn according to E(p) so surfer periodically gets “bored” and jumps to a new page E(p) can thus be personalized for each surfer • An important component of Google’s search criteria But What About Content? • PageRank and Hubs & Authorities – both based purely on link structure – often applied to an pre-computed set of pages filtered for content • So how do (say) search engines do this filtering? • This is the domain of information retrieval Basics of Information Retrieval • Represent a document as a “bag of words”: – – – – – – for each word in the English language, count number of occurences so d[i] is the number of times the i-th word appears in the document usually ignore common words (the, and, of, etc.) usually do some stemming (e.g. “washed”  “wash”) vectors are very long (~100Ks) but very sparse need some special representation exploiting sparseness • Note all that we ignore or throw away: – the order in which the words appear – the grammatical structure of sentences (parsing) – the sense in which a word is used • firing a gun or firing an employee – and much, much more… Bag of Words Document Comparison • View documents as vectors in a very high-dimensional space • Can now import geometry and linear algebra concepts • Similarity between documents d and e: – S d[i]*e[i] over all words i – may normalize d and e first – this is their projection onto each other • Improve by using TF/IDF weighting of words: – term frequency --- how frequent is the word in this document? – inverse document frequency --- how frequent in all documents? – give high weight to words with high TF and low IDF • Search engines: – view the query as just another “document” – look for similar documents via above Looking Ahead: Left Side vs. Right Side • So far we are discussing the “left hand” search results on Google • “Right hand” or “sponsored” search: paid advertisements in a formal market • • Same two types of search/results on Yahoo!, MSN,… Common perception: • But both sides are subject to “gaming” (strategic behavior)… • … and perhaps to outright fraud • More later… – a.k.a “organic” search; – We will spend a lecture on these markets later in the term – – organic results are “objective”, based on content, importance, etc. sponsored results are subjective advertisements – – – organic: invisible terms in the html, link farms and web spam, reverse engineering sponsored: bidding behavior, “jamming” optimization of each side has its own industry: SEO and SEM – – organic: typo squatting sponsored: click fraud

Network Structure and Web Search Networked Life

Related documents

Products

Support

Network Structure and Web Search Networked Life

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib