Searching the Web with Low Space Approximations Ph.D. Thesis Summary Tamás Sarlós Supervisor: András A. Benczúr Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences Informatics Ph.D. School János Demetrovics D.Sc. Foundations and Methods of Informatics Ph.D. Program János Demetrovics D.Sc. Budapest, 2006 1 Introduction The enormous amount of information available on the World Wide Web makes it extremely difficult to find the pages that satisfy our information needs both in terms of their content and quality. The research of web search methods is far from finished; the collection and handling of the gigantic inverted indices containing entries for ten billion or more web pages and distilling the information best suiting the user’s needs involve several technical difficulties. Ranking, similarity search, personalization, and clustering are all non trivial tasks arising in the analysis of the content appearing on the Web, and are active areas of academic research and commercial service development world wide. One of the key algorithmic challenges lies in the sheer size of these problems. Let us only mention that the similarity matrix is of quadratic size the storing of which is clearly infeasible in the case of practical tasks, even on secondary storage. Therefore we need algorithms sublinear in the size of the output. It is the existence of hyperlinks that allows the users to navigate the Web, however these hyperlinks also provide the collection of documents with a supplementary graph structure. Graphs lending themselves to examination with similar methods are also present elsewhere; they appear in the call data recorded at telephone companies or in the access logs of web search engines, etc. The study of the Web as a graph is our main research subject since if combined with text based methods and supplementing them we get a ranking of far better quality. The best example is Google’s PageRank [23] method, which computes the stationary distribution of a large scale Markov chain derived from the complete web graph. The original static PageRank method can be extended; with the help of a personalizing distribution parameterizing the Markov chain we are able to create result pages better suiting the user’s scope of interest and/or the actual query [15]. At the same time, each personalization vector defines an extremely large eigenvalue problem, so if we want to capitalize on the chief virtue of PageRank, the effective calculation and the fast ranking search results, we necessarily have to limit the possible personalizing inputs or settle for an approximate solution [J1, C2]. An equally influential peer of PageRank is Kleinberg’s HITS procedure [20], which is equivalent to the singular value decomposition of a topically focused subgraph derived from the submitted query. This method is less frequent in practice as all calculations must be done on-line, on receiving the query, which requires far too many resources if a large number of queries are to be served. The first common property of all the methods derived during our research is that they can be applied to extremely large data sets, such as graphs or matrices with millions or billions of nodes since their running time is linear or nearly linear in the size of the problems. Secondly, it is proven that their approximation error can be made arbitrarily small. Thirdly, efficiency is achieved by novel use of low-space synopsis data structures originally invented for data stream computation [22]. 1 2 New results Claim 1 Scaling fully personalized PageRank Let us consider the web as a graph G = (V, E): let pages form a vertex set V and hyperlinks define directed edges E between them. Let n denote the number of vertices and m the number edges. Let d+ (v) and d− (v) denote the number of edges leaving and entering v, respectively. In [23] the PageRank vector p = (p(1), . . . , p(n)) is defined as the solution of the following equation p(v) = c · r(v) + (1 − c) · X p(u)/d+ (u), u:(u,v)∈E where r = (r(1), . . . , r(n)) is the teleportation vector and 0 < c < 1 is the teleportation probability with a typical value of c ≈ 0.15. If r is uniform, i.e. r(v) = 1/n for all u, then p is the PageRank. For non-uniform r the solution p is called personalized PageRank; we denote it by PPRr . Since PPRr is linear in r [15, 19], it can be computed by linear combination of personalization to single points u, i.e. to vectors r = χu consisting of all 0 except for node u where χu (u) = 1. Motivated by search engine applications, we investigated two-phase algorithms that first compute a compact PPR database from which value or top list queries required in order to rank the results can be answered quickly with a low number of accesses. In prior work Fogaras and Rácz proved that for any two-phase algorithm capable of computing an arbitrary PPRu (v) value exactly, there exist graphs such that the size of the PPR database grows quadratically in the number of vertices [J1]. With our co-authors we gave the following communication complexity [21] based lower bounds for the size of the approximate PPR database [C2]: Theorem 1 Any two-phase approximate algorithm that computes PPRu (v) value with probability 1 − δ and additive error , when u ∈ H, v ∈ V, H ⊂ V and the graph has at least |H| + 1−c vertices, builds a 8δ database of Ω( 1 · log 1δ · |H|) bits in the worst case. Furthermore, if the algorithm is able to enumerate PPRu (.) with probability 1 − δ and error and the 2 graph has at least 2|H| + 1−c vertices, then it builds a database of Ω( 1−2δ |H| log n) bits in the worst 2 case. 2 Claim 1.1 Optimal approximation algorithms for personalized PageRank [C2] The only prior result for approximating fully personalized PageRank is the Monte Carlo sampling procedure of Fogaras and Rácz [J1], whose database size grows as 1/2 in contrast with the lower bound of Theorem 1 which depends on 1/ only. Our new algorithms [C2] compute a database in time linear in the number of edges, nodes, and 1/ such that the size of the database asymptotically matches to the lower bounds in terms of the parameters , δ, n. Theorem 2 Combining the dynamic programming method of Jeh and Widom [19] with deterministic rounding we can build a PPR database of size O(n · 2 · log n) bits in time O(m/(c · )), such that the above value and top queries could be answered with error . Combining dynamic programming with Count-Min sketches [4] we can build a database of d size O(n ln 1δ /(c · )) bits in time O(m ln 1δ /(c· )) such that for the obtained PPRu (v) ≥ [u (v) > PPRu (v) + 2/c ≤ δ. PPRu (v) − (1 + 2/c) values it holds that Pr PPR Claim 1.2 Experimental evaluation of personalized PageRank algorithms [J1, C2] Using the Stanford WebBase [16] graph consisting of 80 million nodes and 700 million edges we experimentally verified that Monte Carlo sampling computes precise enough PPR values in terms of the accuracy of the induced rankings [J1]. However, in accordance with our theoretical results, dynamic programming combined with rounding [C2] proved to be superior: its accuracy significantly exceeded that of sampling and the best heuristic approach while consuming a significantly lower amount of resources. Claim 2 Scaling SimRank Generalizing co-citation Jeh and Widom [18] define the link-based similarity function SimRank similarly to PageRank as follows. Set Sim(0) (u1 , u2 ) = χu1 (u2 ), and let us consider the provably convergent iterations: (1 − c) · P Sim(k−1) (u , u )/(d− (v )d− (v )) if v 6= v 1 2 1 2 1 2 Sim(k) (v1 , v2 ) = 1 if v1 = v2 , where summation is for u1 , u2 : (u1 , v1 ), (u2 , v2 ) ∈ E. The intuition behind SimRank is that the similarity of a pair of pages is determined by the average similarity of the pages pointing to them. Observe that the size of the Sim table is quadratic in the number of nodes and hence solely the storage complexity of the above iterations makes their execution infeasible. Fogaras and Rácz [9] (i) proved that O(n/2 ) 3 samples are sufficient to approximate SimRank and (ii) empirically measured that similarity functions based on multi-step neighborhoods achieved greater agreement with human judgements than those based on one-step neighborhoods such as co-citation. Claim 2.1 Approximating SimRank [C2] In our work [C2] we showed that if 1/2 < c < 1 then with the help of an inclusion-exclusion formula Sim(u, v) can be reduced to the “dot product” of vectors PPRu (.) and PPRv (.) computed over the transposed graph. Furthermore, we proved that analogues of Theorems 1 and d values obtained by substituting the exact PPR values 2 also hold for the approximate Sim d counterparts during the computation. Thus in time linear in the with their approximate PPR number of edges and nodes our SimRank algorithms build databases of optimal size for value and top queries. Claim 3 Fighting link spam with SpamRank High ranking in a search engine generates a large amount of traffic and hence business opportunities for the owners of a web site. Therefore certain web sites resort to techniques known as spamdexing that provide no added value for the web site’s users; there sole purpose is boosting the ranking of the target page in search engine result lists [13]. The manipulation of PageRank is often accomplished by the creation link farms. The importance of research into methods capable of filtering hyperlink spam is that link farms mislead link analysis, one of the most robust metric for measuring the quality of web pages. Without this filtering the quality of search engines severely deteriorates. Claim 3.1 The SpamRank method [J2, C1] In our papers [J2, C1] we presented SpamRank, a fully automatic method for recognizing the effects of link farms. We define SpamRank in three steps. First we build a personalized PageRank database with scalable Monte Carlos sampling [J1] or with rounded dynamic programming [C2]. By inverting the PPR database we select the so called supporters of each page that significantly contribute to their PageRank value. Then we identify pages that originate suspicious support by statistical analysis. Finally we calculate SpamRank by personalizing PageRank on suspicious pages and thus a high SpamRank value is likely caused by link spam. Our results are validated with a manually annotated German (.de) graph. 4 Claim 4 Approximation algorithms for large matrices The best rank-k of approximation an m × n matrix A in terms of the spectral or Frobenius norm, Ak , is given by the truncated singular value decomposition (SVD). Applications of SVD are abundant in theory and practice; according to Hofmann [17] SVD is “the workhorse of machine learning, data mining, signal processing, and computer vision.” Computing the SVD of dense matrices takes O(min{mn2 , nm2 }) time [12]. While being polynomial it is prohibitively slow for large matrices. The convergence speed of Lánczos/Arnoldi iterative methods applicable to sparse matrices very much depends on the ratio of the singular values. Low-rank approximation of large matrices has been an active area of research since Frieze et al.’s seminal paper [11]. The majority of the published methods is based on approximating matrix products via sampling [6]. The sampling probabilities are non-uniform and depend on the row and column lengths bk of the matrices to be multiplied. Prior result achieved till 2005 allowed us to compute a rank-k matrix A bk kF ≤ kA − Ak k + kAk holds. Observe that the faster than the conventional SVD such that kA − A F F error term kAkF could be arbitrarily large compared to kA − Ak kF . In 2006 three independent results bk kF ≤ (1 + ) kA − Ak k . were published [5, 14, C4] that guarantee an approximation with error kA − A F 2 The algorithm of Har-Peled [14] runs in time Ω(nm(k + k/)). The procedure of Deshpande and Vem- pala [5], which is more efficient, repeats sampling O(k log k) times in an adaptive manner and for a matrix 2 A containing M non-zero elements in time O M k + k 2 log k + (n + m) k + k 2 log k log 1δ bk that is a 1 + approximation with probability 1 − δ. computes a rank-k matrix A It is well known that projecting a d element set of vectors to a random O(log(d)/2 ) dimensional space preserves the length of all vectors up to a 1 ± factor [1, 3], and hence the scalar products are preserved as well. Based on this observation we have the following set of results [C4]: Claim 4.1 Approximate matrix multiplication We presented an algorithm for quickly approximating matrix products in the Frobenius norm. Lemma 3 Let A ∈ Rm×n , B ∈ Rn×p , and 0 < ≤ 1. If S = R where R is an −2 × n random matrix whose rows are completely independent and each row consists of four-wise independent 0 mean ±1 entries, then E AS T SB = AB and 2 E AB − AS T SB F ≤ 22 kAk2F kBk2F . The above lemma could be turned into an algorithm which succeeds with arbitrarily high probability. A technical component of independent interest is our randomized method that 5 generalizes Freivalds’ technique [10] for approximating the Frobenius norm of arbitrary matrix polynomials. The space and time complexity of our algorithm is equivalent to that of the sampling approach published in [6], however, contrary to [6], our method extends unchanged to approximating the product of an arbitrary number of matrices or implicitly given matrices. Claim 4.2 Linear regression Drineas et al. [8] proved that the coefficient matrix of a least squares fitting problem can be downsampled in a way that the solution of the reduced problem provides a solution arbitrary close to the minimum of the original problem as well. Unfortunately however, it is unknown whether the required non-uniform sampling probabilities can be computed any faster than solving the original problem itself. We showed that projecting the coefficient matrix with the Fast Johnson-Lindenstrauss Transform [2] gives an efficient approximation algorithm for the least squares problem. Our proofs, which are different from those of [8], also yield improved upper bounds for the required size of the reduced problem both in case of sampling and random projections. Theorem 4 Suppose that A ∈ Rn×d , b ∈ Rn , 0 < < 1. Let S be a Θ(d log(d)/) × n Johnson-Lindenstrauss matrix and define x̃opt = (SA)+ Sb, where C + denotes the generalized inverse of matrix C. Then Pr kb − Ax̃opt k2 ≤ (1 + ) min kb − Axk2 ≥ 1/3. x∈Rd If S is a Fast Johnson-Lindenstrauss Transform, O(nd log n + d2 (d + log2 n) log(d)/) time. then computing x̃opt takes By repeating the above procedure O(log 1/δ) times the probability of success can be boosted to 1 − δ. Thus, for d = ω(log n) we presented the first approximation linear regression algorithm that undercuts the O(nd2) running time of the exact solution. Claim 4.3 Relative-error low-rank matrix approximation Independently of [5, 14] we achieved the following stronger result for relative-error low-rank matrix approximation by a combination of Theorem 4 and a modifying the proofs of [7]. Theorem 5 Let A ∈ Rm×n and let πB,k (A) denote the best rank-k approximation of matrix A from the subspace spanned by the columns of matrix B. If 0 < ≤ 1 and S is a Θ(k/)×n Johnson-Lindenstrauss matrix with independent ±1 entries, then 6 Pr A − πAS T ,k (A)F ≤ (1 + ) kA − Ak kF ≥ 1/2. 2 M k + (n + m) k2 log 1δ and in two passes over the data we can compute bk such that kA − Abk kF ≤ (1 + ) kA − Ak k holds with probability 1 − δ. a rank-k matrix A Thus in time O F Our method is faster in terms of k than that of Deshpande and Vempala [5]. Moreover it finishes in 2 passes while [5] necessitates O(k log k) sequential scans over large inputs residing on external storage. References [1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003. [2] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th ACM Symposium on Theory of Computing (STOC), 2006. [3] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999. [4] G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms, 55(1):58–75, April 2005. [5] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In Proceedings of the 10th International Workshop on Randomization and Computation (RANDOM), 2006. [6] P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36:132–157, 2006. [7] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Polynomial time algorithm for column-rowbased relative-error low-rank matrix approximation. Technical Report 2006-04, DIMACS, 2006. [8] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for `2 regression and applications. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127–1136, 2006. [9] D. Fogaras and B. Rácz. Scaling link-based similarity search, 2005. To appear in IEEE Transactions on Knowledge and Data Engineering, preliminary version appeared in WWW 2005. 7 [10] R. Freivalds. Probabilistic machines can use less running time. In Proceedings of the IFIP Congress 1977, pages 839–842, 1977. [11] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low rank approximations. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science (FOCS), pages 370–378, 1998. [12] G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1983. [13] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. [14] S. Har-Peled. Low rank matrix approximation in linear time, 2006. Manuscript. [15] T. H. Haveliwala. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 15(4):784–796, July-August 2003. [16] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A repository of web pages. In Proceedings of the 9th World Wide Web Conference (WWW), pages 277–293, 2000. [17] T. Hofmann. Matrix decomposition techniques in machine learning and information retrieval, 2004. http://www.mpi-inf.mpg.de/conferences/adfocs-04/adfocs04.slides-hofmann.pdf. [18] G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 538–543, 2002. [19] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th World Wide Web Conference (WWW), pages 271–279. ACM Press, 2003. [20] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604– 632, 1999. [21] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. [22] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005. [23] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford University, 1998. 8 Publications Journal papers [J1] Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments. Internet Mathematics 2(3):333358, 2005. [J2] András A. Benczúr, Károly Csalogány, Tamás Sarlós, and Máte Uher. SpamRank – Fully Automatic Link Spam Detection. To appear in Information Retrieval. Conference papers [C1] András A. Benczúr, Károly Csalogány, Tamás Sarlós, and Máte Uher. SpamRank – Fully Automatic Link Spam Detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with WWW2005, 2005. [C2] Tamás Sarlós, András A. Benczúr, Károly Csalogány, Dániel Fogaras, and Balázs Rácz. To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis In Proceedings of the 15th International World Wide Web Conference (WWW), pages 297-306, 2006. Full version available at http://datamining.sztaki.hu/www/index.pl/publications-en. [C3] András A. Benczúr, Károly Csalogány, and Tamás Sarlós. Link-Based Similarity Search to Fight Web Spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with SIGIR2006, 2006. [C4] Tamás Sarlós. Improved Approximation Algorithms for Large Matrices via Random Projec- tions. To appear in Proceedings of the 47th IEEE Symposium on Foundations of Computer Science (FOCS), 2006. Conference posters [P1] András A. Benczúr, Károly Csalogány, Dániel Fogaras, Eszter Friedman, Tamás Sarlós, Máte Uher, and Eszter Windhager. Searching a Small National Domain – a Preliminary Report. In Poster Proceedings of the 12th International World Wide Web Conference (WWW), 2003. [P2] András A. Benczúr, Károly Csalogány, and Tamás Sarlós. On the Feasibility of Low-rank Approximation for Personalized PageRank. In Poster Proceedings of the 14th International World Wide Web Conference (WWW), pages 972–973, 2005. 9 Publications in Hungarian [H1] András A. Benczúr, Károly Csalogány, Dániel Fogaras, Eszter Friedman, Balázs Rácz, Tamás Sarlós, Máte Uher, and Eszter Windhager. Magyar nyelvű tartalom a világhálón (Hungarian Content on the WWW). Információs Társadalom és Trendkutató Központ Kutatási Jelentés 26:48-55, 2004. [H2] András A. Benczúr, István Bíró, Károly Csalogány, Balázs Rácz, Tamás Sarlós, and Máte Uher. PageRank és azon túl: Hiperhivatkozások szerepe a keresésben (PageRank and Beyond: The Role of Hyperlinks in Search). To appear in Magyar Tudomány, 2006. 10