Searching the Web with Low Space Approximations

advertisement
Searching the Web with Low Space
Approximations
Ph.D. Thesis Summary
Tamás Sarlós
Supervisor: András A. Benczúr Ph.D.
Eötvös Loránd University
Faculty of Informatics
Department of Information Sciences
Informatics Ph.D. School
János Demetrovics D.Sc.
Foundations and Methods of Informatics Ph.D. Program
János Demetrovics D.Sc.
Budapest, 2006
1
Introduction
The enormous amount of information available on the World Wide Web makes it extremely difficult to
find the pages that satisfy our information needs both in terms of their content and quality. The research
of web search methods is far from finished; the collection and handling of the gigantic inverted indices
containing entries for ten billion or more web pages and distilling the information best suiting the user’s
needs involve several technical difficulties.
Ranking, similarity search, personalization, and clustering are all non trivial tasks arising in the analysis of the content appearing on the Web, and are active areas of academic research and commercial
service development world wide. One of the key algorithmic challenges lies in the sheer size of these
problems. Let us only mention that the similarity matrix is of quadratic size the storing of which is
clearly infeasible in the case of practical tasks, even on secondary storage. Therefore we need algorithms
sublinear in the size of the output.
It is the existence of hyperlinks that allows the users to navigate the Web, however these hyperlinks also provide the collection of documents with a supplementary graph structure. Graphs lending
themselves to examination with similar methods are also present elsewhere; they appear in the call data
recorded at telephone companies or in the access logs of web search engines, etc.
The study of the Web as a graph is our main research subject since if combined with text based
methods and supplementing them we get a ranking of far better quality. The best example is Google’s
PageRank [23] method, which computes the stationary distribution of a large scale Markov chain derived
from the complete web graph. The original static PageRank method can be extended; with the help of
a personalizing distribution parameterizing the Markov chain we are able to create result pages better
suiting the user’s scope of interest and/or the actual query [15]. At the same time, each personalization
vector defines an extremely large eigenvalue problem, so if we want to capitalize on the chief virtue of
PageRank, the effective calculation and the fast ranking search results, we necessarily have to limit the
possible personalizing inputs or settle for an approximate solution [J1, C2].
An equally influential peer of PageRank is Kleinberg’s HITS procedure [20], which is equivalent to
the singular value decomposition of a topically focused subgraph derived from the submitted query. This
method is less frequent in practice as all calculations must be done on-line, on receiving the query, which
requires far too many resources if a large number of queries are to be served.
The first common property of all the methods derived during our research is that they can be applied to
extremely large data sets, such as graphs or matrices with millions or billions of nodes since their running
time is linear or nearly linear in the size of the problems. Secondly, it is proven that their approximation
error can be made arbitrarily small. Thirdly, efficiency is achieved by novel use of low-space synopsis
data structures originally invented for data stream computation [22].
1
2
New results
Claim 1
Scaling fully personalized PageRank
Let us consider the web as a graph G = (V, E): let pages form a vertex set V and hyperlinks define
directed edges E between them. Let n denote the number of vertices and m the number edges. Let d+ (v)
and d− (v) denote the number of edges leaving and entering v, respectively. In [23] the PageRank vector
p = (p(1), . . . , p(n)) is defined as the solution of the following equation
p(v) = c · r(v) + (1 − c) ·
X
p(u)/d+ (u),
u:(u,v)∈E
where r = (r(1), . . . , r(n)) is the teleportation vector and 0 < c < 1 is the teleportation probability
with a typical value of c ≈ 0.15. If r is uniform, i.e. r(v) = 1/n for all u, then p is the PageRank.
For non-uniform r the solution p is called personalized PageRank; we denote it by PPRr . Since PPRr is
linear in r [15, 19], it can be computed by linear combination of personalization to single points u, i.e. to
vectors r = χu consisting of all 0 except for node u where χu (u) = 1.
Motivated by search engine applications, we investigated two-phase algorithms that first compute a
compact PPR database from which value or top list queries required in order to rank the results can be
answered quickly with a low number of accesses.
In prior work Fogaras and Rácz proved that for any two-phase algorithm capable of computing an
arbitrary PPRu (v) value exactly, there exist graphs such that the size of the PPR database grows quadratically in the number of vertices [J1]. With our co-authors we gave the following communication complexity [21] based lower bounds for the size of the approximate PPR database [C2]:
Theorem 1 Any two-phase approximate algorithm that computes PPRu (v) value with probability 1 − δ
and additive error , when u ∈ H, v ∈ V, H ⊂ V and the graph has at least |H| + 1−c
vertices, builds a
8δ
database of Ω( 1 · log 1δ · |H|) bits in the worst case.
Furthermore, if the algorithm is able to enumerate PPRu (.) with probability 1 − δ and error and the
2
graph has at least 2|H| + 1−c
vertices, then it builds a database of Ω( 1−2δ
|H| log n) bits in the worst
2
case.
2
Claim 1.1 Optimal approximation algorithms for personalized PageRank [C2]
The only prior result for approximating fully personalized PageRank is the Monte Carlo
sampling procedure of Fogaras and Rácz [J1], whose database size grows as 1/2 in contrast
with the lower bound of Theorem 1 which depends on 1/ only. Our new algorithms [C2]
compute a database in time linear in the number of edges, nodes, and 1/ such that the size
of the database asymptotically matches to the lower bounds in terms of the parameters , δ, n.
Theorem 2 Combining the dynamic programming method of Jeh and Widom [19] with deterministic rounding we can build a PPR database of size O(n · 2 · log n) bits in time
O(m/(c · )), such that the above value and top queries could be answered with error .
Combining dynamic programming with Count-Min sketches [4] we can build a database of
d
size O(n ln 1δ /(c · )) bits in time O(m ln 1δ /(c· )) such that for the obtained
PPRu (v) ≥
[u (v) > PPRu (v) + 2/c ≤ δ.
PPRu (v) − (1 + 2/c) values it holds that Pr PPR
Claim 1.2 Experimental evaluation of personalized PageRank algorithms [J1, C2]
Using the Stanford WebBase [16] graph consisting of 80 million nodes and 700 million
edges we experimentally verified that Monte Carlo sampling computes precise enough PPR
values in terms of the accuracy of the induced rankings [J1]. However, in accordance with
our theoretical results, dynamic programming combined with rounding [C2] proved to be
superior: its accuracy significantly exceeded that of sampling and the best heuristic approach
while consuming a significantly lower amount of resources.
Claim 2
Scaling SimRank
Generalizing co-citation Jeh and Widom [18] define the link-based similarity function SimRank similarly
to PageRank as follows. Set Sim(0) (u1 , u2 ) = χu1 (u2 ), and let us consider the provably convergent
iterations:

(1 − c) · P Sim(k−1) (u , u )/(d− (v )d− (v )) if v 6= v
1
2
1
2
1
2
Sim(k) (v1 , v2 ) =
1
if v1 = v2 ,
where summation is for u1 , u2 : (u1 , v1 ), (u2 , v2 ) ∈ E. The intuition behind SimRank is that the similarity
of a pair of pages is determined by the average similarity of the pages pointing to them. Observe that
the size of the Sim table is quadratic in the number of nodes and hence solely the storage complexity
of the above iterations makes their execution infeasible. Fogaras and Rácz [9] (i) proved that O(n/2 )
3
samples are sufficient to approximate SimRank and (ii) empirically measured that similarity functions
based on multi-step neighborhoods achieved greater agreement with human judgements than those based
on one-step neighborhoods such as co-citation.
Claim 2.1 Approximating SimRank [C2]
In our work [C2] we showed that if 1/2 < c < 1 then with the help of an inclusion-exclusion
formula Sim(u, v) can be reduced to the “dot product” of vectors PPRu (.) and PPRv (.) computed over the transposed graph. Furthermore, we proved that analogues of Theorems 1 and
d values obtained by substituting the exact PPR values
2 also hold for the approximate Sim
d counterparts during the computation. Thus in time linear in the
with their approximate PPR
number of edges and nodes our SimRank algorithms build databases of optimal size for value
and top queries.
Claim 3
Fighting link spam with SpamRank
High ranking in a search engine generates a large amount of traffic and hence business opportunities
for the owners of a web site. Therefore certain web sites resort to techniques known as spamdexing
that provide no added value for the web site’s users; there sole purpose is boosting the ranking of the
target page in search engine result lists [13]. The manipulation of PageRank is often accomplished by the
creation link farms. The importance of research into methods capable of filtering hyperlink spam is that
link farms mislead link analysis, one of the most robust metric for measuring the quality of web pages.
Without this filtering the quality of search engines severely deteriorates.
Claim 3.1 The SpamRank method [J2, C1]
In our papers [J2, C1] we presented SpamRank, a fully automatic method for recognizing
the effects of link farms. We define SpamRank in three steps. First we build a personalized PageRank database with scalable Monte Carlos sampling [J1] or with rounded dynamic
programming [C2]. By inverting the PPR database we select the so called supporters of
each page that significantly contribute to their PageRank value. Then we identify pages that
originate suspicious support by statistical analysis. Finally we calculate SpamRank by personalizing PageRank on suspicious pages and thus a high SpamRank value is likely caused
by link spam. Our results are validated with a manually annotated German (.de) graph.
4
Claim 4
Approximation algorithms for large matrices
The best rank-k of approximation an m × n matrix A in terms of the spectral or Frobenius norm, Ak , is
given by the truncated singular value decomposition (SVD). Applications of SVD are abundant in theory
and practice; according to Hofmann [17] SVD is “the workhorse of machine learning, data mining, signal
processing, and computer vision.” Computing the SVD of dense matrices takes O(min{mn2 , nm2 })
time [12]. While being polynomial it is prohibitively slow for large matrices. The convergence speed of
Lánczos/Arnoldi iterative methods applicable to sparse matrices very much depends on the ratio of the
singular values.
Low-rank approximation of large matrices has been an active area of research since Frieze et al.’s
seminal paper [11]. The majority of the published methods is based on approximating matrix products
via sampling [6]. The sampling probabilities are non-uniform and depend on the row and column lengths
bk
of the matrices to be multiplied. Prior result achieved till 2005 allowed us to compute a rank-k matrix A
bk kF ≤ kA − Ak k + kAk holds. Observe that the
faster than the conventional SVD such that kA − A
F
F
error term kAkF could be arbitrarily large compared to kA − Ak kF . In 2006 three independent results
bk kF ≤ (1 + ) kA − Ak k .
were published [5, 14, C4] that guarantee an approximation with error kA − A
F
2
The algorithm of Har-Peled [14] runs in time Ω(nm(k + k/)). The procedure of Deshpande and Vem-
pala [5], which is more efficient, repeats sampling
O(k log k) times in an adaptive manner and for a matrix
2
A containing M non-zero elements in time O M k + k 2 log k + (n + m) k + k 2 log k
log 1δ
bk that is a 1 + approximation with probability 1 − δ.
computes a rank-k matrix A
It is well known that projecting a d element set of vectors to a random O(log(d)/2 ) dimensional
space preserves the length of all vectors up to a 1 ± factor [1, 3], and hence the scalar products are
preserved as well. Based on this observation we have the following set of results [C4]:
Claim 4.1 Approximate matrix multiplication
We presented an algorithm for quickly approximating matrix products in the Frobenius norm.
Lemma 3 Let A ∈ Rm×n , B ∈ Rn×p , and 0 < ≤ 1. If S = R where R is an −2 × n
random matrix whose rows are completely independent and each row consists of four-wise
independent 0 mean ±1 entries, then
E AS T SB = AB
and
2 E AB − AS T SB F ≤ 22 kAk2F kBk2F .
The above lemma could be turned into an algorithm which succeeds with arbitrarily high
probability. A technical component of independent interest is our randomized method that
5
generalizes Freivalds’ technique [10] for approximating the Frobenius norm of arbitrary matrix polynomials. The space and time complexity of our algorithm is equivalent to that of the
sampling approach published in [6], however, contrary to [6], our method extends unchanged
to approximating the product of an arbitrary number of matrices or implicitly given matrices.
Claim 4.2 Linear regression
Drineas et al. [8] proved that the coefficient matrix of a least squares fitting problem can be
downsampled in a way that the solution of the reduced problem provides a solution arbitrary
close to the minimum of the original problem as well. Unfortunately however, it is unknown
whether the required non-uniform sampling probabilities can be computed any faster than
solving the original problem itself.
We showed that projecting the coefficient matrix with the Fast Johnson-Lindenstrauss Transform [2] gives an efficient approximation algorithm for the least squares problem. Our
proofs, which are different from those of [8], also yield improved upper bounds for the
required size of the reduced problem both in case of sampling and random projections.
Theorem 4 Suppose that A ∈ Rn×d , b ∈ Rn , 0 < < 1. Let S be a Θ(d log(d)/) × n
Johnson-Lindenstrauss matrix and define x̃opt = (SA)+ Sb, where C + denotes the generalized inverse of matrix C. Then
Pr kb − Ax̃opt k2 ≤ (1 + ) min kb − Axk2 ≥ 1/3.
x∈Rd
If S is a Fast Johnson-Lindenstrauss Transform,
O(nd log n + d2 (d + log2 n) log(d)/) time.
then computing x̃opt takes
By repeating the above procedure O(log 1/δ) times the probability of success can be boosted
to 1 − δ. Thus, for d = ω(log n) we presented the first approximation linear regression
algorithm that undercuts the O(nd2) running time of the exact solution.
Claim 4.3 Relative-error low-rank matrix approximation
Independently of [5, 14] we achieved the following stronger result for relative-error low-rank
matrix approximation by a combination of Theorem 4 and a modifying the proofs of [7].
Theorem 5 Let A ∈ Rm×n and let πB,k (A) denote the best rank-k approximation of matrix
A from the subspace spanned by the columns of matrix B. If 0 < ≤ 1 and S is a Θ(k/)×n
Johnson-Lindenstrauss matrix with independent ±1 entries, then
6
Pr A − πAS T ,k (A)F ≤ (1 + ) kA − Ak kF ≥ 1/2.
2
M k + (n + m) k2 log 1δ and in two passes over the data we can compute
bk such that kA − Abk kF ≤ (1 + ) kA − Ak k holds with probability 1 − δ.
a rank-k matrix A
Thus in time O
F
Our method is faster in terms of k than that of Deshpande and Vempala [5]. Moreover
it finishes in 2 passes while [5] necessitates O(k log k) sequential scans over large inputs
residing on external storage.
References
[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins.
Journal of Computer and System Sciences, 66(4):671–687, 2003.
[2] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th ACM Symposium on Theory of Computing (STOC), 2006.
[3] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
[4] G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch
and its applications. Journal of Algorithms, 55(1):58–75, April 2005.
[5] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In
Proceedings of the 10th International Workshop on Randomization and Computation (RANDOM),
2006.
[6] P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36:132–157, 2006.
[7] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Polynomial time algorithm for column-rowbased relative-error low-rank matrix approximation. Technical Report 2006-04, DIMACS, 2006.
[8] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for `2 regression and
applications. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA),
pages 1127–1136, 2006.
[9] D. Fogaras and B. Rácz. Scaling link-based similarity search, 2005. To appear in IEEE Transactions
on Knowledge and Data Engineering, preliminary version appeared in WWW 2005.
7
[10] R. Freivalds. Probabilistic machines can use less running time. In Proceedings of the IFIP Congress
1977, pages 839–842, 1977.
[11] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low rank approximations. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science (FOCS),
pages 370–378, 1998.
[12] G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore,
1983.
[13] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International
Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
[14] S. Har-Peled. Low rank matrix approximation in linear time, 2006. Manuscript.
[15] T. H. Haveliwala. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search.
IEEE Transactions on Knowledge and Data Engineering, 15(4):784–796, July-August 2003.
[16] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A repository of web pages. In
Proceedings of the 9th World Wide Web Conference (WWW), pages 277–293, 2000.
[17] T. Hofmann. Matrix decomposition techniques in machine learning and information retrieval, 2004.
http://www.mpi-inf.mpg.de/conferences/adfocs-04/adfocs04.slides-hofmann.pdf.
[18] G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In Proceedings of the
8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages
538–543, 2002.
[19] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th World Wide Web
Conference (WWW), pages 271–279. ACM Press, 2003.
[20] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–
632, 1999.
[21] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[22] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
[23] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to
the web. Technical Report 1999-66, Stanford University, 1998.
8
Publications
Journal papers
[J1] Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments. Internet Mathematics 2(3):333358, 2005.
[J2] András A. Benczúr, Károly Csalogány, Tamás Sarlós, and Máte Uher. SpamRank – Fully Automatic Link Spam Detection. To appear in Information Retrieval.
Conference papers
[C1] András A. Benczúr, Károly Csalogány, Tamás Sarlós, and Máte Uher. SpamRank – Fully Automatic Link Spam Detection. In Proceedings of the 1st International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb), held in conjunction with WWW2005, 2005.
[C2] Tamás Sarlós, András A. Benczúr, Károly Csalogány, Dániel Fogaras, and Balázs Rácz. To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis In Proceedings
of the 15th International World Wide Web Conference (WWW), pages 297-306, 2006.
Full version available at http://datamining.sztaki.hu/www/index.pl/publications-en.
[C3] András A. Benczúr, Károly Csalogány, and Tamás Sarlós. Link-Based Similarity Search to Fight
Web Spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with SIGIR2006, 2006.
[C4] Tamás Sarlós.
Improved Approximation Algorithms for Large Matrices via Random Projec-
tions. To appear in Proceedings of the 47th IEEE Symposium on Foundations of Computer Science
(FOCS), 2006.
Conference posters
[P1] András A. Benczúr, Károly Csalogány, Dániel Fogaras, Eszter Friedman, Tamás Sarlós, Máte Uher,
and Eszter Windhager. Searching a Small National Domain – a Preliminary Report. In Poster
Proceedings of the 12th International World Wide Web Conference (WWW), 2003.
[P2] András A. Benczúr, Károly Csalogány, and Tamás Sarlós. On the Feasibility of Low-rank Approximation for Personalized PageRank. In Poster Proceedings of the 14th International World Wide
Web Conference (WWW), pages 972–973, 2005.
9
Publications in Hungarian
[H1] András A. Benczúr, Károly Csalogány, Dániel Fogaras, Eszter Friedman, Balázs Rácz, Tamás Sarlós, Máte Uher, and Eszter Windhager. Magyar nyelvű tartalom a világhálón (Hungarian Content
on the WWW). Információs Társadalom és Trendkutató Központ Kutatási Jelentés 26:48-55, 2004.
[H2] András A. Benczúr, István Bíró, Károly Csalogány, Balázs Rácz, Tamás Sarlós, and Máte Uher.
PageRank és azon túl: Hiperhivatkozások szerepe a keresésben (PageRank and Beyond: The Role
of Hyperlinks in Search). To appear in Magyar Tudomány, 2006.
10
Download