The PageRank Citation Ranking: Bringing Order to the Web

advertisement
Presented By:
- Chandrika B N
Agenda














Technology Overview
Introduction
Link Structure of the Web
Simplified PageRank
Eigenvalue and Eigenvector
PageRank Definition
Random Surfer Model
Dangling Links
PageRank Implementation
Convergence
Searching with PageRAnk
Personalized PageRank
Application
Conclusion
Technology Overview
 Recognized the need for a new kind of server setup
 Linked PC’s to quickly find each query’s answers
This resulted in: Faster Response Time
Greater Scalability
Lower costs
 Google uses more than 200 signals (including PageRank
algorithm) to determine which pages are important
 Google then performs hypertext-matching
- Google Corporate Information
Life of a Google Query
- Google Corporate Information
The mechanism
•Web Crawler: Finds and retrieves pages on the web
•Repository: web pages are compressed and stored here
•Indexer: each index entry has a list of documents in which the term appears
and the location within the text where it occurs
Introduction
 WWW is very large and heterogeneous
 The web pages are extremely diverse
Problem:
How can the most relevant pages be ranked at the top?
Answer:
Take advantage of the link structure of the Web to produce ranking of
every web page known as PageRank
Link Structure of the Web
A and B are Backlinks of C
•Every page has some number of
forward links (outedges) and backlinks
(inedges)
•We can never know all the backlinks
of a page, but we know all of its
forward links
•Generally, highly linked pages are
more “important”
PageRank
• PageRank - a method for computing a ranking for every web page based on the
graph of the web
• A page has high rank if the sum of the ranks of its backlinks is high
•
•
Page has many backlinks
Page has a few highly ranked backlinks
A page is important if important pages refer to it
• Page rank is a link analysis algorithm that assigns a numerical weight that
represents how important a page is on the web
• The web is democratic i.e., pages
vote for pages
Google interprets a link from page A
to page B as a vote, by page A, for page B.
It also analyses the page that cast the vote.
Simple Ranking Function:
u: web page
Bu: backlinks
Nu = |Fu| number of links from u
c: factor used for normalization
Simplified PageRank Calculation
The PageRanks form a probability distribution over web pages, so the sum
of all web pages’ PageRanks will be one
Eigenvalue and Eigenvector
 Eigenvalues and Eigenvectors are properties of a matrix
 In general, a matrix acts on a vector by changing both its magnitude
and direction
 However, a matrix may act on certain vectors by changing only their
magnitude, and leaving their direction unchanged – Eigenvector
 A matrix acts on an eigenvector by multiplying its magnitude by a
factor called the Eigenvalue
Given a linear transformation A, a non-zero vector x is defined to be an
eigenvector of the transformation if it satisfies the eigenvalue equation
In this situation, the scalar λ is called an eigenvalue of A corresponding to the
eigenvector x
 Given a square matrix A, the eigenvalue eq can be expressed as
 The eigenvector equation for A can be written as
Example
A =
λ is the eigenvalue
Solving this eq we get λ = 1 and λ = 3
 Considering first the eigenvalue λ = 3, we have
After matrix-multiplication
This can be represented as 2 linear equations:
2x + y = 3x and x + 2y = 3y
The equations can be reduced to x = y
We can choose any value for x. Taking x=1, we get y=1
Eigenvector with
eigenvalue 3
Eigenvector with
eigenvalue 1
Computing PageRank given a Directed Graph
The Transition matrix A =
We get the eigenvalue λ = 1
Calculating the eigenvector
On substituting
we get,
so the vector u is of the form
Choose v to be the unique eigenvector with the sum of all entries equal to 1
PageRank vector
Calculating the PageRank
Finding the Eigenvalue and Eigenvector
Let Au,v = 1/Nu , if there is an edge from u to v
0, otherwise
If R is a vector over the web pages,
then R = cAR
where ,
R: eigenvector of A
c: eigenvalue
Problem: Rank Sink
•Consider two web pages that point to each other but to
no other page
•Suppose there is some web page which points to one of
them, then
•During iteration, this loop will accumulate rank but
will never distribute any rank
•This forms a trap called the RANK SINK. This can be
overcome by introducing a Rank Source
PageRank Definition:
Let E(u) be some vector over the Web pages that corresponds to a source
of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to
the Web pages which satisfies
PageRank of
document u
Normalization
factor
PageRank of
document v
that links to u
Vector of web
pages that the
Surfer randomly
jumps to
Number of outlinks
from document v
such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of R’).
Computing PageRank
R
S
0
S: any vector over the web pages
Loop:
R

AR
i

1
i
Calculate the Ri+1 vector using Ri
d

R

R
i1
i

1
1
Calculate the normalizing factor
R

R

dE
i

1
i

1
Find the vector Ri+1 using d


R

R
Find the norm of the difference of 2 vectors
i

1
while



i
Loop until convergence
Random Surfer Model
 The “Random surfer” simply keeps clicking on successive links at
random
 A Real Web Surfer will unlikely continue in a loop forever
 The surfer periodically “gets bored” and jumps to another
random page
Dangling Links
 Links that point to any page with no outgoing
links
 They do not affect the ranking of any other
page directly
Problem: It is not clear where their weight should be
distributed
Solution: They can be removed from the system until all the
PageRanks are calculated
PageRank Implementation
 Convert each URL into a unique integer ID
 Sort the link structure by ID
 Remove the dangling links
 Make an initial assignment of ranks
 Iteratively compute PageRank until Convergence
 Add the dangling links back
 Recompute the rankings
NOTE: After adding the dangling links back, we need to iterate as many
times as was required to remove the dangling links
Convergence
 PR (322 Million Links): 52 iterations
 PR (161 Million Links): 45 iterations
 Scaling factor is roughly linear in logn
Convergence
 The web is an expander-like graph
 A graph is said to be an expander if:
 Every subset of nodes S has a neighborhood that is larger than some
factor α times |S|
 α is called the expansion factor
 A graph has a good expansion factor if and only if the largest
eigenvalue is sufficiently larger than the second-largest
eigenvalue
Searching with PageRank
• Two search engines:
– Title-based search engine
– Full text search engine
• Title-based search engine
– Searches only the “Titles”
– Finds all the web pages whose titles contain all the query words
– Sorts the results by PageRank
– Very simple and cheap to implement
– Title match ensures high precision, and PageRank ensures high
quality
• Full text search engine
– Called Google
– Examines all the words in every stored document and also performs
PageRank (Rank Merging)
Title-based search for University
Personalized PageRank
 Important component of PageRank calculation is E
 A vector over the web pages (used as source of rank)
 Powerful parameter to adjust the page ranks
 E vector corresponds to the distribution of web pages that a
random surfer periodically jumps to
 Having an E vector that is uniform over all the web pages results
in some web pages with many related links receiving an overly
high rank eg: copyright page or forums
 General Search over the internet
 Instead in Personalized PageRank E consists of a single web page
Applications
 Estimating Web Traffic
On analyzing the statistics, it was found that there are some sites that
have a very high usage, but low PageRank.
eg: Links to pirated software
 PageRank as Backlink Predictor
The goal is to try to crawl the pages in as close to the optimal order as
possible i.e., in the order of their rank.
PageRank is a better predictor than citation counting
 User Navigation: The PageRank Proxy
The user receives some information about the link before they click on it
This proxy can help users decide which links are more likely to be
interesting
Conclusion
 PageRank is a global ranking of all web pages base of their
location in the Web’s graph structure
 PageRank uses information which is external to the Web pages –
backlinks
 Backlinks from important pages are more significant than
backlinks from average pages
 The structure of the Web graph is very useful for information
retrieval tasks.
References
 L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking:
Bringing Order to the Web, 1998
 L. Page and S. Brin. The anatomy of a large-scale hypertextual web search
engine, 1998
 THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND
GOOGLE by KURT BRYAN AND TANYA LEISE
 Google Corporate Information: http://www.google.com/corporate/tech.html
 http://en.wikipedia.org/wiki/PageRank
 http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
 http://www.googleguide.com/google_works.html
 http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lectu
re3.html
 http://pr.efactory.de/
Download