Towards Scaling Fully Personalized PageRank

advertisement
Towards Scaling Fully
Personalized PageRank
Dániel Fogaras, Balázs Rácz
Budapest University of
Technology and Economics
Computer and Automation
Research Institute of the
Hungarian Academy of Sciences
Problem formulation
PageRank(Brin,Page,’98)
PV  (1  c)  PV  M  c  r
PV PageRank vector, r uniform distribution vector
Overall quality measure of Web pages
Pre-computation: evaluate PV by power iteration
Query: order results by PV
Personalized PageRank(Brin,Page,’98)
r preference vector of a user, query dependent
PPV(r):=PV personalized quality measure of Web pages
Pre-computation: r is not known. What to compute?
Query: power-iteration. 5 hours/query!!!
Towards Scaling Fully Personalized PageRank
1/14
Dániel Fogaras, Balázs Rácz
Preliminaries
Linearity: PPV (1r1     k rk )  1PPV (r1 )     k PPV (rk )
Full personalization
Pre-compute PPV(ri) for all pages
V2 disk, V(V+E) time, where V ≈ 109, E ≈ 1010, ???
Topic-Sensitive PageRank (Haveliwala ’01)
Linearity
Pre-compute PPV(ri) for a topical basis r1,…,rk, k≈20
Query: user submits a topic by 1 ,,  k
Query engine combines PPV(ri) vectors
Scaling Personalized Web Search (Jeh, Widom, ’03)
Decomposition, linearity
Pre-compute PPV(ri) for unit vectors r1,…,rk, corresponding to
k≈10.000 pages
Query: personalization over the 10.000 pages
Towards Scaling Fully Personalized PageRank
2/14
Dániel Fogaras, Balázs Rácz
Towards full personalization
Our algorithm
Monte Carlo simulation, not power iteration
Pre-compute approximate PPV(ri) for all unit vectors
r1,…,rk, k=number of pages
Scalability: quasi linear pre-computation & sub-linear
query
Main points of this presentation
Outline of the algorithm
Pre-computation: external-memory, distributed
Query: used to increase precision
Error of approximation tends to zero exponentially
Exact vs. approximated PPV -- space lower bounds
Towards Scaling Fully Personalized PageRank
3/14
Dániel Fogaras, Balázs Rácz
Outline of the Algorithm
Theorem (Jeh, Widom ’03, F ’03)
Random walk starts from page u
Uniform step with probability 1-c, stops with c
PPV(u,v)=Pr{ the walk stops at page v }
Monte Carlo algorithm
Pre-computation
From u simulate N independent random walks
Database of fingerprints: ending vertices of the
walks from all vertices
Query
PPV(u,v) : = # ( walks u→v ) / N
Towards Scaling Fully Personalized PageRank
4/14
Dániel Fogaras, Balázs Rácz
External memory pre-computation
Goal: N independent random walks from
each vertex
Input: webgraph V ≈ 109, E ≈ 1010
V+E > memory
Accessing the edges
Edge scan --- stream access
Edges sorted by source vertices
Towards Scaling Fully Personalized PageRank
5/14
Dániel Fogaras, Balázs Rácz
External memory pre-computation (2)
Goal: N independent random walks from
each vertex
Simulate all walks together
Iteration: 1 blink = 1 edge scan
Sort path ends
Merge with the sorted graph
Each walk stops with prob. c
E( #walks ) = (1-c)k∙N∙V
after k iterations
Towards Scaling Fully Personalized PageRank
6/14
Dániel Fogaras, Balázs Rácz
Distributed indexing
M machines with fast local network connections
memory < V+E ≤ M∙(memory)
Parallelize for N∙V walks
M=3
Parts of the graph in RAM
Remote transfers batched
Heuristic partition: one site to one machine
Machine1: www.cnn.com/*, Machine2: www.yahoo.com/*
Uniform load balance ← ordinary PR distributed equally
Towards Scaling Fully Personalized PageRank
7/14
Dániel Fogaras, Balázs Rácz
Query, increasing precision
Database of N∙V fingerprints (path endings)
Query: PPV(u) : = empirical distribution
from N samples
Theorem (Jeh, Widom, ’03)
PPV (u )  (1  c) 
 PPV (v)  c  r
u
vO ( u )
O(u) denotes out-neighbors of u
Query: PPV(u) : = empirical distribution
from N∙|O(u)| samples
Number of fingerprints for a query
F = N∙(db accesses/query)
Towards Scaling Fully Personalized PageRank
8/14
Dániel Fogaras, Balázs Rácz
Error of approximation
Exact: PPV(u,v)
Approximate by F fingerprints: PPV(u,v)
PPV (u, v)
Theorem
If PPV
PPV(u,v)
PPV(u,
(u, v) >PPV
(u, w)   holds, then
Pr{
PPV(u,v)
- 0.3∙N∙δ
Pr{PPV
(u, v<) PPV(u,w)
PPV (u,}w<)}exp(
 exp(
0.3 2F)   2 )
Idea of the proof
PPV(u,v)
- PPV(u,w)
FN∙(
 (PPV
(u, v)  PPV
(u, w)) )#=(u#(u→v)
 v)  # -(u#(u→w
 w) ) =
=sum of F iid. random variables with values {-1,0,1}
Bernstein’s inequality
Error of approximation → 0 exponentially with
F = (db size/vertex)∙(db accesses/query) → ∞
Towards Scaling Fully Personalized PageRank
9/14
Dániel Fogaras, Balázs Rácz
Exact versus approximate
Model of computation
Input: G graph with V vertices
Pre-compute a database of size D
Query: respond by accessing only the db.
Exact
Query: u,v,w
Decide if PPV(u,v) > PPV(u,w) holds
Approximate for fixed ε and δ
Query: u,v,w
Decide if PPV(u,v) > PPV(u,w) holds with error
probability ε when | PPV(u,v) - PPV(u,w) | > δ
Towards Scaling Fully Personalized PageRank
10/14
Dániel Fogaras, Balázs Rácz
Lower bounds for the db size
For the webgraph V ≈ 109
Theorem 1
For the Exact problem D = Ω(V2) sized db is
required in worst case
Theorem 2
For the Approximate problem D = Ω(V)
Is it possible to improve the 2nd lower bound?
Our algorithm uses a D = O(V logV) sized db
Towards Scaling Fully Personalized PageRank
11/14
Dániel Fogaras, Balázs Rácz
Idea of the lower bound proofs
One-way communication complexity
Bit-vector probing (BVP)
Alice has a bit vector
Input: x = (x1, x2, …, xm )
Communication
B bits
Bob has a number
Input: 1 ≤ k ≤ m
Xk = ?
Theorem: B ≥ m for any protocol
Reduction from Exact-PPV to BVP
Alice has x = (x1, x2, …, xm )
Bob has 1 ≤ k ≤ m
Communication
G graph with V vertices,
where V2 = m
Exact PPV db, D bits
Pre-compute an Exact PPV
database of size D
Thus D = B ≥ m= V2
Towards Scaling Fully Personalized PageRank
12/14
u, v, w vertices
PPV(u,v) ? PPV(u,w)
Xk = ?
Dániel Fogaras, Balázs Rácz
Summary
Fully personalized PR
Monte-Carlo method, not power iteration
Pre-computation
External-memory, distributed
Query
Increase precision by (db accesses/query)
Error of approximation
Tends to zero exponentially
Space lower bounds
Quadratic for Exact PPR
Linear for Approximate PPR
13/14
Thank you!
Towards Scaling Fully Personalized PageRank
14/14
Dániel Fogaras, Balázs Rácz
Misc
PPV (u, v) = #(u→v) = Binom(N,PPV(u,v))
N∙PPV(u,v)
Claim (by Chernoff’s bound):
(u, v) > (1+δ) PPV(u,v) } <
Pr{ PPV
PPV(u,v)
exp(-N∙PPV(u,v)∙δ2/4)
If for a protocol Pr{right answer} ≥ (1+γ) / 2 then B
≥ γ ∙m
PV PageRank vector, c constant, M normalized
adjacency matrix,
Towards Scaling Fully Personalized PageRank
15/14
Dániel Fogaras, Balázs Rácz
Download