Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz Budapest University of Technology and Economics Computer and Automation Research Institute of the Hungarian Academy of Sciences Problem formulation PageRank(Brin,Page,’98) PV (1 c) PV M c r PV PageRank vector, r uniform distribution vector Overall quality measure of Web pages Pre-computation: evaluate PV by power iteration Query: order results by PV Personalized PageRank(Brin,Page,’98) r preference vector of a user, query dependent PPV(r):=PV personalized quality measure of Web pages Pre-computation: r is not known. What to compute? Query: power-iteration. 5 hours/query!!! Towards Scaling Fully Personalized PageRank 1/14 Dániel Fogaras, Balázs Rácz Preliminaries Linearity: PPV (1r1 k rk ) 1PPV (r1 ) k PPV (rk ) Full personalization Pre-compute PPV(ri) for all pages V2 disk, V(V+E) time, where V ≈ 109, E ≈ 1010, ??? Topic-Sensitive PageRank (Haveliwala ’01) Linearity Pre-compute PPV(ri) for a topical basis r1,…,rk, k≈20 Query: user submits a topic by 1 ,, k Query engine combines PPV(ri) vectors Scaling Personalized Web Search (Jeh, Widom, ’03) Decomposition, linearity Pre-compute PPV(ri) for unit vectors r1,…,rk, corresponding to k≈10.000 pages Query: personalization over the 10.000 pages Towards Scaling Fully Personalized PageRank 2/14 Dániel Fogaras, Balázs Rácz Towards full personalization Our algorithm Monte Carlo simulation, not power iteration Pre-compute approximate PPV(ri) for all unit vectors r1,…,rk, k=number of pages Scalability: quasi linear pre-computation & sub-linear query Main points of this presentation Outline of the algorithm Pre-computation: external-memory, distributed Query: used to increase precision Error of approximation tends to zero exponentially Exact vs. approximated PPV -- space lower bounds Towards Scaling Fully Personalized PageRank 3/14 Dániel Fogaras, Balázs Rácz Outline of the Algorithm Theorem (Jeh, Widom ’03, F ’03) Random walk starts from page u Uniform step with probability 1-c, stops with c PPV(u,v)=Pr{ the walk stops at page v } Monte Carlo algorithm Pre-computation From u simulate N independent random walks Database of fingerprints: ending vertices of the walks from all vertices Query PPV(u,v) : = # ( walks u→v ) / N Towards Scaling Fully Personalized PageRank 4/14 Dániel Fogaras, Balázs Rácz External memory pre-computation Goal: N independent random walks from each vertex Input: webgraph V ≈ 109, E ≈ 1010 V+E > memory Accessing the edges Edge scan --- stream access Edges sorted by source vertices Towards Scaling Fully Personalized PageRank 5/14 Dániel Fogaras, Balázs Rácz External memory pre-computation (2) Goal: N independent random walks from each vertex Simulate all walks together Iteration: 1 blink = 1 edge scan Sort path ends Merge with the sorted graph Each walk stops with prob. c E( #walks ) = (1-c)k∙N∙V after k iterations Towards Scaling Fully Personalized PageRank 6/14 Dániel Fogaras, Balázs Rácz Distributed indexing M machines with fast local network connections memory < V+E ≤ M∙(memory) Parallelize for N∙V walks M=3 Parts of the graph in RAM Remote transfers batched Heuristic partition: one site to one machine Machine1: www.cnn.com/*, Machine2: www.yahoo.com/* Uniform load balance ← ordinary PR distributed equally Towards Scaling Fully Personalized PageRank 7/14 Dániel Fogaras, Balázs Rácz Query, increasing precision Database of N∙V fingerprints (path endings) Query: PPV(u) : = empirical distribution from N samples Theorem (Jeh, Widom, ’03) PPV (u ) (1 c) PPV (v) c r u vO ( u ) O(u) denotes out-neighbors of u Query: PPV(u) : = empirical distribution from N∙|O(u)| samples Number of fingerprints for a query F = N∙(db accesses/query) Towards Scaling Fully Personalized PageRank 8/14 Dániel Fogaras, Balázs Rácz Error of approximation Exact: PPV(u,v) Approximate by F fingerprints: PPV(u,v) PPV (u, v) Theorem If PPV PPV(u,v) PPV(u, (u, v) >PPV (u, w) holds, then Pr{ PPV(u,v) - 0.3∙N∙δ Pr{PPV (u, v<) PPV(u,w) PPV (u,}w<)}exp( exp( 0.3 2F) 2 ) Idea of the proof PPV(u,v) - PPV(u,w) FN∙( (PPV (u, v) PPV (u, w)) )#=(u#(u→v) v) # -(u#(u→w w) ) = =sum of F iid. random variables with values {-1,0,1} Bernstein’s inequality Error of approximation → 0 exponentially with F = (db size/vertex)∙(db accesses/query) → ∞ Towards Scaling Fully Personalized PageRank 9/14 Dániel Fogaras, Balázs Rácz Exact versus approximate Model of computation Input: G graph with V vertices Pre-compute a database of size D Query: respond by accessing only the db. Exact Query: u,v,w Decide if PPV(u,v) > PPV(u,w) holds Approximate for fixed ε and δ Query: u,v,w Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ Towards Scaling Fully Personalized PageRank 10/14 Dániel Fogaras, Balázs Rácz Lower bounds for the db size For the webgraph V ≈ 109 Theorem 1 For the Exact problem D = Ω(V2) sized db is required in worst case Theorem 2 For the Approximate problem D = Ω(V) Is it possible to improve the 2nd lower bound? Our algorithm uses a D = O(V logV) sized db Towards Scaling Fully Personalized PageRank 11/14 Dániel Fogaras, Balázs Rácz Idea of the lower bound proofs One-way communication complexity Bit-vector probing (BVP) Alice has a bit vector Input: x = (x1, x2, …, xm ) Communication B bits Bob has a number Input: 1 ≤ k ≤ m Xk = ? Theorem: B ≥ m for any protocol Reduction from Exact-PPV to BVP Alice has x = (x1, x2, …, xm ) Bob has 1 ≤ k ≤ m Communication G graph with V vertices, where V2 = m Exact PPV db, D bits Pre-compute an Exact PPV database of size D Thus D = B ≥ m= V2 Towards Scaling Fully Personalized PageRank 12/14 u, v, w vertices PPV(u,v) ? PPV(u,w) Xk = ? Dániel Fogaras, Balázs Rácz Summary Fully personalized PR Monte-Carlo method, not power iteration Pre-computation External-memory, distributed Query Increase precision by (db accesses/query) Error of approximation Tends to zero exponentially Space lower bounds Quadratic for Exact PPR Linear for Approximate PPR 13/14 Thank you! Towards Scaling Fully Personalized PageRank 14/14 Dániel Fogaras, Balázs Rácz Misc PPV (u, v) = #(u→v) = Binom(N,PPV(u,v)) N∙PPV(u,v) Claim (by Chernoff’s bound): (u, v) > (1+δ) PPV(u,v) } < Pr{ PPV PPV(u,v) exp(-N∙PPV(u,v)∙δ2/4) If for a protocol Pr{right answer} ≥ (1+γ) / 2 then B ≥ γ ∙m PV PageRank vector, c constant, M normalized adjacency matrix, Towards Scaling Fully Personalized PageRank 15/14 Dániel Fogaras, Balázs Rácz