Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011 Ranking for searching results Modern search engines may return millions of pages for a single query. This amount is prohibitive to preview for human users, hence need a method to filter a small set of most authoritative results. An ranking method will help to process the query results and put the most useful information on the top of the list. Link based methods focus on the way that pages reference on another, provided an efficient way to filter the authoritative results. Queries: Specific queries. E.g. “What does Dr. Chris Mattmann’s think of the presentations between 3:30-5:00 PM PDT, June 30 2011. ” – very few pages, difficult to determine the identity of these pages. Broad-topic queries. E.g. “java” – Too many pages, difficult to find the authority pages for traditional text-based search engine. Similar-page queries. E.g. “find page similar to java” – similar as broad-topic queries. Related to Class material HITS stands for Hypertext Induced Topic Search HITS was a pioneered link based ranking. One of the major web ranking model mentioned in the class. This presentation will goes into the details of how to calculate “authority” and “hub” pages, which is mentioned in the class. We will compare with the other link based algorithm: PageRank We will evaluate the pros and cons of the paper. Outline Link-based algorithms HITS algorithm Expansions Constructing a Focused Subgraph of the WWW Computing Hubs and Authorities Comparison with PageRank Similar-Page Queries (modification) Social Network/Scientific Citation Multiple Set of Hubs and Authorities Diffusion and generalization Evaluation Pros and Cons of the paper Link based ranking algorithm Challenge of the text-based ranking www.harvard.edu, most authoritative pages for query “harvard”. However, other pages may content “harvard” keyword more often. Pages are not sufficiently self descriptive: e.g. query “search engine”. Google do not use the term on their pages. Number of pages too large to preview. Link based ranking algorithm Links encoded some human latent judgment Creating a page p by including a link to page q has in some measure conferred authority on q. No need self-descriptive. Balance of relevance and popularity in the authority criteria (automobile VW, Benz, BMW webpage, also www.yahoo.com, large number of in-degree, lack thematic unity.) Link based ranking algorithm Authority: A authority is a page with many inlinks. The page may have good or authoritative content on some topic and many people trust it and link to it. Hub: A hub is a page with many out-links. The page serves as an organizer of the information on a particular topic and points to many good authority pages on the topic. Link based ranking algorithm PageRank (Brin & Page 1998): Computed for all the webpages before query (Query independent). Compute the authority only Fast to compute HITS Performed on the set of retrieved webpages for each query (Query dependent) Compute authority and hubs More calculation needed, slow in real time query HITS Algorithm Step1: Constructing a Focused Subgraph of the WWW. Requirement: 1. Sq (collection of pages wrt query q) is small 2. Sq is rich in relevant pages 3. Sq contains most of the strongest authorities Subgraph(q,E,t,d) q: a query string E: a text-based searching engine /*Narrow down: form AltaVista*/ Let Rq denote the top t results of E on q. Set Sq := Rq For each page p in Rq : /*Expanding*/ Add all pages that p points to into the Sq; Add all pages point to p to Sq. (If the number of these pages is greater than d, randomly select d pages and add to Sq.) /* Limit: a single pointed pages can bring in maximum d pages. Otherwise, can involve hundred thousands extra pages */ /*remove intrinsic links (for website navigation), and anti-collusion (allow up to m pages from a single domain to point to any given page)*/ Return Sq HITS Algorithm Step 2: Computing Hubs and Authorities Rules: 1. A good hub points to many good authorities. 2. A good authority is pointed to by many good hubs. 3. Authorities and hubs have a mutual reinforcement relationship. Let authority score of the page i be x(i), and the hub score of page i be y(i). mutual reinforcing relationship: I step: x(i) y( j ) ( j ,i )E O step: y(i) x( j ) ( i , j )E HITS Algorithm 5 2 3 1 4 x(1) = y(2) + y(3) + y(4) 1 6 7 y(1) = x(5) + x(6) + x(7) HITS Algorithm x(i) y( j ) x A T ( j ,i )E y (i ) y xk ( AT A)k 1 Az (unit vector) x( j ) y Ax y ( i , j )E k ( AAT ) k z (unit vector) where A is theadjacencymatrixof subgraph G (Sq for query q) HITS Algorithm Recap: If A is a square matrix, a non-zero vector v is an eigenvector of A if there is a scalar λ such that Av = λv HITS Algorithm x(i) y( j ) x A T ( j ,i )E y (i ) y xk ( AT A)k 1 A z (unit vector) T k x ( j ) y Ax y ( AA ) z (unit vector) k ( i , j )E where A is theadjacencymatrixof subgraph G (Sq for query q) HITS Algorithm HITS Algorithm The Iterate(G,k) procedure can be applied to filter out the top c authorities and top c hubs. HITS Results www.roadahead.com rank 123rd by AltaVista. Text-based search ignore the authorities. Text-based search + link analysis works. Do not content many of the query string “Gates”. Related work Similar page queries: Social Network find t pages containing the string q find t pages pointing to p. Honda ford, toyota, etc. Measure of standing by path counting(Katz): Scientific Citations Multiple set of Hubs and Authorities Same query string corresponding to different meaning. Multiple set of Hubs and Authorities Highlights of the method Developed a set of algorithmic tools for extracting information from the link structures environments. Formulate the notion of authority based on relationship between a set of “authority” pages and “hub” pages. Proposed a heuristic algorithm to find these pages. Surveyed variants and applications Evaluation: HITS vs PageRank EigenGaps Difference between the largest and 2nd largest eigenvalue of M matrix. Work from Ng 2001, compared the stability of convergence. Idea: The Cora database is a collection containing the citation (similar to link) information from several thousand papers in AI. Article is truly authoritative or influential, then surely the addition of a few links or a few citations should not make us change our minds about these sites or articles having been very influential. Based on this idea, Ng et. al. constructed a set of five perturbed databases in which 30% of the papers from the base set were randomly deleted Evaluation: HITS vs PageRank HITS PageRank Evaluation: HITS vs PageRank •The eigenvalues of the matrices are indicated by the directions of the principal axes of the ellipses. •Small perturbation cause 45 degree change when eigengap small. No change when eigengap large. Evaluation: Pros Creative idea of formulating the authority concept into “Authority” and “Hub”, especially in 1998 Efficient heuristic algorithm so solve the Authority weights and Hub weights. Query-driven dynamic ranking Solid theoretical background Abundant variants and applications Evaluation: Cons The convergence is not as robust as PageRank when there are some perturbation. Topic drift In-efficiency at run-time. User behavior information is not integrated. Reference J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997. Stable algorithms for link analysis. A. Y. Ng, A. X. Zheng, and M. I. Jordan. Proceedings of the 24th International Conference on Research and Development in Information Retrieval (SIGIR), New York, NY: ACM Press, 2001 Wikipedia: www.wikipedia.org Questions? Thanks for time!