stablealgorithms

advertisement
Stable Algorithms for Link Analysis
Andrew Y.Ng, Alice X. Zheng, Michael I. Jordan
The paper discusses briefly on the issue of stable rankings under small perturbations to linkage patterns.
The authors extend the analysis and show how it gives insight into the ways of designing stable link
analysis methods. They also propose two new algorithms and study the performance using citation data and
web hyperlink data.
An important feature of the World Wide web is its dynamic nature. References can be changed, become
inaccessible, or be missed by a search engine. Hence link analysis needs to be stable to perturbations in link
structure. The Klienberg HITS algorithm and the Google PageRank algorithm, are eigenvector based
methods; they essentially compute principal eigenvectors of particular matrices related to the adjacency
graph to determine “authority”. Understanding the robustness of link algorithms therefore involves an
analysis of the stability of these eigenvector calculations.
A brief overview of the HITS and PageRank algorithms given. An example is shown where a small
perturbation to a collection has a large change to the eigenvectors. An algorithm’s stability to small
perturbations is determined by the eigengap of S, which is defined to be the difference between the biggest
and the second biggest eigenvalue. A section describes the stability of the HITS and the PageRank
algorithm.
Randomized HITS
Let there be a random surfer who is able to follow hyperlinks in both forward and in backward directions.
The surfer starts from a randomly chosen page, and visits a new page at every time step. Every time step,
he tosses a coin with a bias , and if the coin lands heads, he jumps to a new webpage chosen uniformly at
random. If the coin lands tails, then he checks if it is an odd time step or an even time step. If it is an odd
step, then he follows a randomly chosen out-link from the current page; if it is an even step, then he
traverses a random in-link of the current page. This process defines a random walk on web pages, and the
stationary distribution on odd time steps is defined to be the authority weights.. Similarly, the stationary
distribution on even time steps is defined to be the hub weights.

T
a (t 1)   1  (1   ) Arow
h (t )

h (t 1)   1  (1   ) Acol a (t 1)

where
1 is the vector of all ones, Arow is the same as A with rows normalized to sum to 1, and Acol is A
with its columns normalized to sum to 1.
Subspace HITS
The basis is that subspaces spanned by a few eigenvectors may sometimes be stable even when individual
vectors are not. If the eigengap between the kth and k+1 st eigenvalue is large, then the subspace spanned by
the first k eigenvectors will be stable. The eigenvectors are treated as a basis for a subspace to obtain
authority scores. The procedure for calculating authority scores, where f(.) is a non-negative, montonically
increasing function:
1.
Find the first k eigenvectors x1, x2,….xk of S = ATA ( or AAT for hub weights ), and their
corresponding eigen values 1, ……..k.
2.
Let ej be the j-th basis vector( whose jth-element is 1, and all other elements 0 ). Calculate the authority
scores a j 
k
 f ( )(e
i 1
i
T
j
xi ) 2 . (this is the square of the length of the projection of ej onto the
subspace spanned by x1, x2,….xk, where the projection in the xi direction is weighted by
If we take f ( ) = 1, the authority weight of the page becomes
k

i 1
f (i ) .)
2
x ij . Thus, the authority weights
depend only on the subspace spanned by K eigenvectors, but not the eigenvectors themselves.
They finally report the empirical performance of the four algorithms, and explored the issue of ‘diversity’
of the results returned by the algorithms, focusing on the setting of web graphs with multiple connected
components.
Download