Authoritative Sources in a Hyperlinked Environment

advertisement
Authoritative Sources in a
Hyperlinked Environment
By: Jon M. Kleinberg
Presented by: Yemin Shi
CS-572 June 30 2011
Ranking for searching results




Modern search engines may return millions of pages for a single query. This
amount is prohibitive to preview for human users, hence need a method to
filter a small set of most authoritative results.
An ranking method will help to process the query results and put the most
useful information on the top of the list.
Link based methods focus on the way that pages reference on another,
provided an efficient way to filter the authoritative results.
Queries:



Specific queries. E.g. “What does Dr. Chris Mattmann’s think of the presentations between
3:30-5:00 PM PDT, June 30 2011. ” – very few pages, difficult to determine the identity of
these pages.
Broad-topic queries. E.g. “java” – Too many pages, difficult to find the authority pages for
traditional text-based search engine.
Similar-page queries. E.g. “find page similar to java” – similar as broad-topic queries.
Related to Class material





HITS stands for Hypertext Induced Topic Search
HITS was a pioneered link based ranking. One of
the major web ranking model mentioned in the
class.
This presentation will goes into the details of how
to calculate “authority” and “hub” pages, which is
mentioned in the class.
We will compare with the other link based
algorithm: PageRank
We will evaluate the pros and cons of the paper.
Outline


Link-based algorithms
HITS algorithm




Expansions






Constructing a Focused Subgraph of the WWW
Computing Hubs and Authorities
Comparison with PageRank
Similar-Page Queries (modification)
Social Network/Scientific Citation
Multiple Set of Hubs and Authorities
Diffusion and generalization
Evaluation
Pros and Cons of the paper
Link based ranking algorithm

Challenge of the text-based ranking



www.harvard.edu, most authoritative pages for query “harvard”.
However, other pages may content “harvard” keyword more often.
Pages are not sufficiently self descriptive: e.g. query “search
engine”. Google do not use the term on their pages.
Number of pages too large to preview.
Link based ranking algorithm

Links encoded some human latent judgment


Creating a page p by including a link to page q
has in some measure conferred authority on q.
No need self-descriptive.
Balance of relevance and popularity in the
authority criteria (automobile  VW, Benz,
BMW webpage, also www.yahoo.com, large
number of in-degree, lack thematic unity.)
Link based ranking algorithm

Authority: A authority is a page with many inlinks.


The page may have good or authoritative content on
some topic and many people trust it and link to it.
Hub: A hub is a page with many out-links.

The page serves as an organizer of the information on a
particular topic and points to many good authority
pages on the topic.
Link based ranking algorithm

PageRank (Brin & Page 1998):




Computed for all the webpages before query (Query
independent).
Compute the authority only
Fast to compute
HITS



Performed on the set of retrieved webpages for each
query (Query dependent)
Compute authority and hubs
More calculation needed, slow in real time query
HITS Algorithm

Step1: Constructing a Focused Subgraph of the WWW.
Requirement:
1. Sq (collection of pages wrt query q) is small
2. Sq is rich in relevant pages
3. Sq contains most of the strongest authorities
Subgraph(q,E,t,d)
q: a query string
E: a text-based searching engine
/*Narrow down: form AltaVista*/
Let Rq denote the top t results of E on q.
Set Sq := Rq
For each page p in Rq : /*Expanding*/
Add all pages that p points to into the Sq;
Add all pages point to p to Sq. (If the number of these pages is greater than d,
randomly select d pages and add to Sq.)
/* Limit: a single pointed pages can bring in maximum d pages. Otherwise, can
involve hundred thousands extra pages */
/*remove intrinsic links (for website navigation), and anti-collusion (allow up
to m pages from a single domain to point to any given page)*/
Return Sq
HITS Algorithm

Step 2: Computing Hubs and Authorities
Rules:
1. A good hub points to many good authorities.
2. A good authority is pointed to by many good hubs.
3. Authorities and hubs have a mutual reinforcement relationship.
Let authority score of the page i be x(i), and the hub score of page i be y(i).
mutual reinforcing relationship:
I step:
x(i) 
 y( j )
( j ,i )E
O step:
y(i) 
 x( j )
( i , j )E
HITS Algorithm
5
2
3
1
4
x(1) = y(2) + y(3) + y(4)
1
6
7
y(1) = x(5) + x(6) + x(7)
HITS Algorithm
x(i) 
 y( j )  x  A
T
( j ,i )E
y (i ) 
y  xk  ( AT A)k 1 Az (unit vector)
 x( j )  y  Ax  y
( i , j )E
k
 ( AAT ) k z (unit vector)
where A is theadjacencymatrixof subgraph G (Sq for query q)
HITS Algorithm

Recap:

If A is a square matrix, a non-zero vector v is an
eigenvector of A if there is a scalar λ such that
Av = λv
HITS Algorithm
x(i) 
 y( j )  x  A
T
( j ,i )E
y (i ) 
y  xk  ( AT A)k 1 A z (unit vector)
T k
x
(
j
)

y

Ax

y

(
AA
) z (unit vector)

k
( i , j )E
where A is theadjacencymatrixof subgraph G (Sq for query q)
HITS Algorithm
HITS Algorithm

The Iterate(G,k) procedure can be applied to filter out the top c authorities and
top c hubs.
HITS Results

www.roadahead.com rank 123rd by AltaVista.

Text-based search ignore the authorities.

Text-based search + link analysis works. Do not
content many of the query string “Gates”.
Related work

Similar page queries:




Social Network



find t pages containing the string q
find t pages pointing to p.
Honda  ford, toyota, etc.
Measure of standing by path counting(Katz):
Scientific Citations
Multiple set of Hubs and Authorities

Same query string corresponding to different meaning.
Multiple set of Hubs and Authorities
Highlights of the method

Developed a set of algorithmic tools for
extracting information from the link
structures environments.



Formulate the notion of authority based on
relationship between a set of “authority” pages
and “hub” pages.
Proposed a heuristic algorithm to find these
pages.
Surveyed variants and applications
Evaluation: HITS vs PageRank

EigenGaps


Difference between the largest and 2nd largest eigenvalue of M matrix.
Work from Ng 2001, compared the stability of convergence.
Idea:
The Cora database is a collection containing the citation (similar to link)
information from several thousand papers in AI. Article is truly
authoritative or influential, then surely the addition of a few links or a
few citations should not make us change our minds about these sites or
articles having been very influential.
Based on this idea, Ng et. al. constructed a set of five perturbed
databases in which 30% of the papers from the base set were randomly
deleted
Evaluation: HITS vs PageRank

HITS

PageRank
Evaluation: HITS vs PageRank
•The eigenvalues of the matrices are indicated by the directions of
the principal axes of the ellipses.
•Small perturbation cause 45 degree change when eigengap small.
No change when eigengap large.
Evaluation: Pros





Creative idea of formulating the authority
concept into “Authority” and “Hub”,
especially in 1998
Efficient heuristic algorithm so solve the
Authority weights and Hub weights.
Query-driven dynamic ranking
Solid theoretical background
Abundant variants and applications
Evaluation: Cons




The convergence is not as robust as
PageRank when there are some
perturbation.
Topic drift
In-efficiency at run-time.
User behavior information is not integrated.
Reference



J. Kleinberg. Authoritative sources in a hyperlinked
environment. Proc. 9th ACM-SIAM Symposium on
Discrete Algorithms, 1998. Extended version in
Journal of the ACM 46(1999). Also appears as IBM
Research Report RJ 10076, May 1997.
Stable algorithms for link analysis. A. Y. Ng, A. X.
Zheng, and M. I. Jordan. Proceedings of the 24th
International Conference on Research and
Development in Information Retrieval (SIGIR), New
York, NY: ACM Press, 2001
Wikipedia: www.wikipedia.org
Questions?

Thanks for time!
Download