CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB Kemal Efe, Vijay Raghavan, and Arun Lakhotia University of Louisiana Presented by Lan Nie 09/01/2005, Lehigh University Introduction Search engine Crawl, index and retrieve information about web pages. find all of the relevant pages rank them by relevance to the user query present a rank-ordered result Recall and Precision Early Search Engine Solely keyword matching Lots of low quality pages, rankings rarely agreed with user’s interests Synonymy and polysemy Modern Search Engine Linkage structure provides valuable information. Link analysis combined with content analysis Substantially improve the search quality Link Analysis Authority Flow Model Link: a channel for authority flow A page q with authority rank rq(i) at iteration i distribute all its current authority equally among its outgoing links. rp (i 1) q:q p rq (i ) / F ( q) However.. An authorative page on a subject is likely to be co-cited with other authoritative pages on the same subject Rank of a page should augmented by its co-citation degree. Random Walk Model Surfer walks on the web graph and make random decisions about where to go next PageRank is a combination of authority flow model and random walk model Continued.. Authority and Hub Co-citation matrix ATA Entry (p,q) : the number of joint co-citations received by p and q; Entry (p,p) : the indegree of page p. Bibliographic coupling matrix AAT Authority / Hub a[i 1] ( AT A) a[i ] h(i 1) ( AAT ) h[i ] diagonal term: authority is influenced by number of citation non-diagonal term: authority is influenced by the degree co-citation Influence (co-citation) >> Influence (citation) A more general model : different weights for diagonal terms and no-diagonal terms in the above computation HITS algorithm combined the authority and hub idea together Content Analysis Which pages are important in the Web Graph? (Link Analysis) Which pages are relevant to the query? (Content Analysis) Tasks of Content Analysis how a page is relevant to the user query Similarity between documents in vector space Cosine Similarity Okapi measure, Three Level Scoring,Cover Density Ranking Where on the page to search for the query terms Fields: Title, Anchor text, Abstract Properties: Font, Highlighting, Capitalization, distance between subquery terms Deal with synonymy and polysemy LSI,GVSM Application in classification, document search and relevance ranking. Combining Content and Link Analysis (PageRank) Page Rank: A Random Surfer (Brin and Page[1998]) With 1-d, jumps to a random page; with d, follows a random outlink. Rank Is independent of query/ topic. PR( p ) d PR( q) / F ( q)+(1 d ) q p 1 N Topic Sensitive Page Rank: Multiple Focused Surfer(Haveliwala[2002]) A set of predefined topics (top level categories of ODP), with Ct as the set of URLs in the ODP category t. Each page is assigned a rank vector , one rank for each topic. Each surfer is focused on a specific topic t With 1-d, jumps to a page in Ct; with d, follows a random outlink 1 if k Ct d * TSPRt ( p ) (1 d )TSPRt ( q) / F ( q) Ct if k Ct 0 For a given query, a page’s query-sensitive score is inner product of the page’s rank vector and the query’s topic distribution vector. Combining Content and Link Analysis (HITS) HITS Sampling Use query to collect a root set of pages from a textual search engine Expand the root set into a base set by adding pages linked to and from the root set Calculation of Authorities and Hubs h Aa AAT h a AT h AT Aa Problems Tightly Knit Community (TKC) effect: HITS has converged to the regions of the web graph which is highly connected . How about TKC is irrelevant to the topic? Page propagate the same authority weight to each outgoing page Result is dominated by one community, a page would be deemed unimportant if it is popular of a smaller community. Example:Jaguar Continued.. Improvement of HITS Chakrabarti et al.[1999] Outlinks in different part of page may point to different topics Page splitting: outlinks in the small page tend to be on the same topic Li et al.[2002] A good hub is likely to be cited, hub weights of pages are increased depending on their authority weights Cohn and Chang[2001] PHITS: A probabilistic model to rank a page within its own community rather than within the entire base set Dean et. Al.[1999] What’s Related? Given the seed page, find its parents, children of its parents, its children, parents of its children Given a see page, find pages link to it, and what else they link to Output pages that are most frequently co-cited with the seed URL Bharat and Henzinger[1998], Chakrabarti et al.[1998] Weighted HITS Weighted HITS CLEVER project (Chakrabarti et al.[1998]) A relevance weight is computed for each link W (p, q): The number of query matches in the surrounding texts of the link p->q a ( p ) q:q ( 1+w( q, p ))* h ( q) p h ( p ) q: p ( 1+w( q, p ))* a ( q) q Query Expansion (Bharat and Henzinger[1998]) A relevance weight is assigned to each page Broader Query Q: concatenation the first 1000 words from each doc in the root set W(p): cosine similarity between page p and broader query Q a ( p ) w( p ) * q:q p h ( q ) h ( p ) w( p ) * q: p q a ( q )