Bryan Morgan CompSci 49S Shivnath Babu

advertisement
Bryan Morgan
CompSci 49S
Shivnath Babu
The Google Pagerank Algorthm and How it Works
Introduction
In this paper, the author, Ian Rogers, will:
1. Clearly explain how PageRank is calculated.
2. Go through every example in Chris’ paper, and add some more of my own,
showing the correct PageRank for each diagram. By showing the code used to
calculate each diagram I’ve opened myself up to peer review - mostly in an effort
to make sure the examples are correct, but also because the code can help explain
the PageRank calculations.
3. Describe some principles and observations on website design based on these
correctly calculated examples.
How is PageRank Used?
PageRank is one of the methods Google uses to determine a page’s relevance or
importance. The maximum PR of all pages on the web changes every month when
Google does its re-indexing. PageRank says nothing about the content or size of a page,
the language it’s written in, or the text used in the anchor of a link!
So what is PageRank?
In short PageRank is a “vote”, by all the other pages on the Web, about how important a
page is. A link to a page counts as a vote of support. If there’s no link there’s no support
(but it’s an abstention from voting rather than a vote against the page).
The equation for PageRank is:
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn/C(TN))
Let’s break the equation down into sections
1. PR(Tn) – each page has a conception of its own self-importance. So “PR(T1)” is
for the first page in the web and “PR(Tn)” is the last page
2. C(Tn) – This is the number of outgoing links a page has.
3. PR(Tn)/C(Tn) – so if our page (Page A) has a backlink from page :n” the share of
the vote page A will get is “PR(Tn)/C(Tn)”
4. d – d is the dampering factor. It damps down the total vote by multiplying by 0.85
(the factor “d”)
5. (1-d) – if a page does not get pointed to, it will still get a small PR of .15 (i.e. 10.85).
How is PageRank Calculated?
This is where it gets tricky. The PR of each page depends on the PR of the pages pointing
to it. But we won’t know what PR those pages have until the pages pointing to them have
their PR calculated and so on… And when you consider that page links can form circles
it seems impossible to do this calculation!
But actually it’s not that bad. Remember this bit of the Google paper:
PageRank or PR(A) can be calculated using a simple iterative algorithm, and
corresponds to the principal eigenvector of the normalized link matrix of the web.
What that means to us is that we can just go ahead and calculate a page’s PR without
knowing the final value of the PR of the other pages. That seems strange but, basically,
each time we run the calculation we’re getting a closer estimate of the final value. So all
we need to do is remember the each value we calculate and repeat the calculations lots of
times until the numbers stop changing much.
Examples:
We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do
some calculations:
d = 0.85
PR(A) = (1 – d) + d(PR(B)/1)
PR(B) = (1 – d) + d(PR(A)/1)
i.e.
PR(A) = 0.15 + 0.85 * 1
=1
PR(B) = 0.15 + 0.85 * 1
=1
Hmm, the numbers aren’t changing at all! So it looks like we started out with a lucky
guess!!!
If you do the same thing for this example here is what it would look like:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
a:
0.00000
0.15000
0.48612
0.81913
1.04169
1.19042
1.28982
1.35626
1.40065
1.43032
1.45015
1.46341
1.47226
1.47818
1.48214
1.48478
1.48655
1.48773
1.48852
1.48904
1.48940
1.48963
1.48979
1.48990
1.48997
1.49001
1.49004
1.49007
1.49008
1.49009
1.49009
1.49010
1.49010
1.49010
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
b:
0.00000
0.21375
0.35660
0.49813
0.59272
0.65593
0.69818
0.72641
0.74528
0.75789
0.76632
0.77195
0.77571
0.77823
0.77991
0.78103
0.78178
0.78228
0.78262
0.78284
0.78299
0.78309
0.78316
0.78321
0.78324
0.78326
0.78327
0.78328
0.78328
0.78329
0.78329
0.78329
0.78329
0.78329
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
c:
0.00000
0.39544
0.78721
1.04904
1.22403
1.34097
1.41912
1.47136
1.50626
1.52959
1.54518
1.55560
1.56257
1.56722
1.57033
1.57241
1.57380
1.57473
1.57535
1.57576
1.57604
1.57622
1.57635
1.57643
1.57649
1.57652
1.57655
1.57656
1.57657
1.57658
1.57659
1.57659
1.57659
1.57659
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
d:
0.00000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
a: 1.49010 b: 0.78329 c: 1.57659
a: 1.49011 b: 0.78329 c: 1.57660
a: 1.49011 b: 0.78330 c: 1.57660
a: 1.49011 b: 0.78330 c: 1.57660
a: 1.49011 b: 0.78330 c: 1.57660
a: 1.49011 b: 0.78330 c: 1.57660
Average pagerank = 1.0000
d:
d:
d:
d:
d:
d:
0.15000
0.15000
0.15000
0.15000
0.15000
0.15000
Download