- Missouri State University

Exploiting Web Matrix Permutations to Speedup PageRank Computation

Presented by: Aries Chan, Cody Lawson, and Michael Dwyer

Introduction



Internet Statistics



151 million active internet user as of

January 2004



76% used a search engine at least once during the month



Average time spent searching was about

40 minutes

Introduction



Search Engines







Most common means of accessing Web

Easiest method of organizing and accessing information

Therefore, high quality / usable search engines are very important

Rank

-

1

2

3

Provider

All Search

Google Search

Yahoo! Search

MSN/Windows Live/Bing Search

Searches (000)

10,812,734

6,986,580

1,726,060

1,156,415

Share of Searches

100.0%

64.6%

16.0%

10.7%

Introduction



Basic Idea of a Search Engine



Scanning Web Graph



Using crawlers to create an index of the information



Ranking



Organizing and ordering information in a usable way

Introduction



Webpage Ranking Problems



Personalization

 http://www.google.com/history/?hl=en



Updating – Keeping order up to date



25% of links are changed in one week



5% “new content” in one week



35% of entire web changed in eleven weeks



Reducing Computation Time

Introduction



Accelerating the PageRank



Use compression to fit the Web Graph into main memory





Use sequence of scans of the Web

Graph to efficiently compute in external memory

Reduce computation time through numerical methods

Introduction



Reduce computation time through numerical methods



Use iterative methods such as Gauss-Seidel and Jacobi method (Arasu et al and









Bianchini et al)

Subtract off estimates of non-principal eigenvectors (Kamvar et al )

Sort the graph lexicographically by url resulting in an approximate block structure

(Kamvar et al)

Split into two problems: dangling nodes and non-dangling nodes (Lee et al)

Others have been working on ways to only update the ranks of nodes influenced by

Web Graph changes

Introduction



Del Corso, Gulli, and Romani

(authors of the paper)



Numerical optimization of PageRank



View the PageRank computation as a linear system





Transform a dense problem into one which uses a sparse matrix

Treatment of dangling nodes which naturally adapts to the random surfer model



Exploiting web matrix permutations



Increase data locality and reduce number of iterations necessary to solve the problem

Google’s PageRank Overview





Web as an oriented graph

The random surfer



 v i

= ∑ p ji v j

(sum of PageRanks of nodes linking to i weighted by the transition probability)

Equilibrium distribution

 v T = v T H

(left eigenvector of

H with eigenvalue 1)

Problems with the ideal model



Dangling nodes (trap the user)



Impose a random jump to every other page



 B = H + au T

Cyclic paths in the Web Graph

(reducibility)





Brin and Page suggested adding artificial transitions (low probability jump to all nodes)

G =



B + (1

 

)eu T

Current PageRank Solution



Since

G is just a rank one modification of 

H

, the power method takes advantage of the sparsity of matrix

H

.

Google’s PageRank eigenproblem as a linear system





Expand



 v T G = v T

(eigenproblem)

G =



(H + au T ) + (1

 

)eu T



(expansion of Google matrix)

 v T (



H +

 au T ) + (1

 

)v T eu T = v T

Restructure



Taking

S = I

 

H T

  ua T





And with v T e = 1



Sv = (1

 

)u

(after taking transpose and rearranging)

Dangling Nodes Problems





Dangling Nodes





Pages with no links to other pages

Pages whose existence is inferred but crawler has not reached

According to a 2001 sample, approximately 76% are dangling nodes.

Dangling Nodes Problems







Natural Model





B = H + au T

Jump with probability 1 to any other node

Drastic Model





Completely removes dangling nodes

Problems





Dangling nodes themselves are not ranked

Removing nodes create new dangling nodes

Self-loop Model eg.

 B = H + F







F ij



= {1 if i = j & dangling; 0 otherwise}

Still row stochastic and is similar to natural

Problem

Gives unreasonably high rank to the dangling nodes

Which model is the best?





Natural model



 the most “accurate”.

Problem



Gives a much more dense matrix

B



It is at least partially for this reason that the problem is approached as an eigenproblem to exploit the sparsity of

H

Can we have an equally lightweight iterative approach?

Iterative Approach with Sparsity





Sparse matrix

R

 R = I

 

H T





PageRank v obtained by solving

 Ry = u

, v =

 y such that

||v||

1

=1

Why does this work?

Since

S = R

  ua T



 and

Sv = (1

 

)u

(R

  ua T )v = (1

 

)u



Use Sherman-Morrison formula to calculate the inverse

Iterative Approach with Sparsity



We get

(R

  v = (1 ua T ) -1 = R

 

)(1 +

-1 +

R -1 ua T R -1

1/



+ a T R -1 u a T y

1/



+ a T y

)y v =

 y



= (1

 

)(1 + a T y

1/



+ a T y

)



The vector v is our PageRank vector and was solved using sparse matrix

R

Exploiting Web Matrix

Permutations







Use a variety of “cheap” operators to permute the web graph in an organized way in hopes to:

 increase data locality

 reduce the number of iterations in order to solve the problem

Explore different iterative methods in order to solve the problem quicker.

Compare the performance of different iterative methods based on specifically permuted web graphs.

Permutation Strategies



The following operations were chosen based on their “limited impact on the computational cost” (Del Corso et al.)













O

- orders nodes by increasing out-degree

Q

- orders nodes by decreasing out-degree

X

- orders nodes by increasing in-degree

Y

- orders nodes by decreasing in-degree

B

- ordering the nodes according to their

BFS (Breadth First Search) order of visit

T

- transposes the matrix

Permutation Strategies (cont.)







O

,

Q

,

X

,

Y

, and

T operators  a full matrix

The B operator conveniently arranges

R into a lower block triangular structure due to BFS order of visit.

Combining these operations on

R

, the following structures of the permuted web matrix are produced.

Permutation Strategies (cont.)



Visual representations of the permuted web graph.

Iterative Strategies







Power Method



Computes dominant eigenvector

Jacobi Method



Using an initial guess, approximates the solution to the linear system of equations. Each successive iteration uses the previous approximated solution as its next guess till a degree of convergence is reached.

(*further explained)

Gauss-Seidel Method (Reverse Gauss-Seidel)



Modification of the Jacobi Method, which approximates the solution to each successive equation in the linear system based on the values derived from the previous equations, all within each iterative loop.

(*further explained)

*(http://college.cengage.com/mathematics/larson/elementary_linear/5e/students/ch08-10/chap_10_2.pdf)

Iterative Strategies (cont.)







Further exploration led to iterative methods based on the distinct block structure of certain web graph permutations.

DN (or DNR) Method



The permuted matrix

R

OT has the property that the lower diagonal block coincides with the identity matrix. The matrix can be easily partitioned into non-dangling and dangling portions. Then the non-dangling part is solved by Gauss-Seidel (or Reverse Gauss-Seidel respectively).

LB/UB/LBR/UBR Methods



Uses the Gauss-Seidel or Reverse Gauss-Seidel methods to solve the individual blocks of the triangular block matrices produced by the

B operator.

Results

Results (cont.)





For both the Power Method and Jacobi Method, the number of Mflops is not dependent on permutations of the web matrix. (“the small differences in numerical data are due to the finite precision” [Del Corso et al.])

The Jacobi Method (applied to the matrix

R

) is only a slight improvement

(about 3%) compared to the Power Method (applied to the matrix

G

).

Results (cont.)





The Gauss-Seidel and Reverse Gauss-Seidel Methods reduced Mflops by around 40% and running time by around 45% on the full matrix compared to the Power Method on the full matrix.

In particular the Reverse Gauss-Seidel Method performed on the permuted matrix

R reduced the number of Mflops by 51% and running time by

YTB

82% when compared to the Power method on the full matrix.

Results (cont.)





The block methods achieved even better results

In particular, the best overall reduction in computation time and number of

Mflops was achieved by LBR method on the permuted matrix

R

QTB

.

58% reduction in terms of Mflops

89% reduction in terms of running time

Conclusion







Objective:



Accelerating Google PageRank by numerical methods

Contribution:







Viewed web matrix as a sparse linear system

Formalized new method for treating dangling nodes

Explored new iterative methods and applied them to web matrix permutations

Achievement:





1/10 of the computation time

Reduced over 50% Mflops

References





G.M. Del Corso, A. Gulli, F. Romani,

“Exploiting Web Matrix Permutations to Speedup PageRank Computation”

Nielsen MegaView Search (http://enus.nielsen.com/rankings/insights/rank ings/internet)

- Missouri State University

Exploiting Web Matrix Permutations to Speedup PageRank Computation

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Google’s PageRank Overview

Problems with the ideal model

Current PageRank Solution

Google’s PageRank eigenproblem as a linear system

Dangling Nodes Problems

Dangling Nodes Problems

Which model is the best?

Iterative Approach with Sparsity

Iterative Approach with Sparsity

Exploiting Web Matrix

Permutations

Permutation Strategies

Permutation Strategies (cont.)

Permutation Strategies (cont.)

Iterative Strategies

Iterative Strategies (cont.)

Results

Results (cont.)

Results (cont.)

Results (cont.)

Conclusion

References

Related documents

Products

Support

- Missouri State University

Exploiting Web Matrix Permutations to Speedup PageRank Computation

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Google’s PageRank Overview

Problems with the ideal model

Current PageRank Solution

Google’s PageRank eigenproblem as a linear system

Dangling Nodes Problems

Dangling Nodes Problems

Which model is the best?

Iterative Approach with Sparsity

Iterative Approach with Sparsity

Exploiting Web Matrix

Permutations

Permutation Strategies

Permutation Strategies (cont.)

Permutation Strategies (cont.)

Iterative Strategies

Iterative Strategies (cont.)

Results

Results (cont.)

Results (cont.)

Results (cont.)

Conclusion

References

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib