notes - Department of Computer Science

advertisement
Google Search Engine*
CS461 Lecture
Department of Computer Science
Iowa State University
1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, S. Brin and
L. Page, in Proceeding of WWW’98
2. “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin,
R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998
What to cover today
PageRank
Google Architecture
Problem Statement
Ultimate version

Find what I want
 In most cases, I don’t know exactly or cannot
expressed clearly what I want
 “What-I-want” can be estimated using a set of
keywords
Simplified version

Find the files that are most related to a set
of keywords
Naïve Solution
How it works


Download the entire Internet to a local machine
Search and return all files containing the set of
keywords
Problems: all files are treated equally
importance


Could return tons of files, but most of them are
not what I want
Since most users simply check out the first few
files, this scheme actually cannot find much useful
things
Ranking Based on Hit Rate
How it works

A file is ranked higher if it is visited more
frequently
Problems


Could be affected by faked hits
A file will be ranked higher and higher
Ranking based on Citation
Basic idea

A paper is important if it is cited by many papers
 Each paper has a set of references that link to the related work
 A pioneering paper typically has a high citation

An HTML page is more important if it is linked by many
other page
 Each page may link to other pages
Problems

Publish of academic papers is well-controlled
 Many are peer-reviewed
 Chronically ordered

Internet files could be anything
Proposed: PageRank
Basic idea

A page with many links to it is more likely to be
useful than one with few links to it
 Just like citation

The links from a page that itself is the target of
many links are likely to be particularly important
 This is something new
Proposed: PageRank
Basic idea

A page with many links to it is more likely to be
useful than one with few links to it
 Just like citation

The links from a page that itself is the target of
many links are likely to be particularly important
 This is something new
back links
forward link
Each link has different weight
Proposed: PageRank
How it works


Each page is ranked using a value called PageRank (PR)
A page’s PR depends on the PRs of its back link pages
PR(A)=(1-d) + d*[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)]
d: damping factor, normally this is set to 0.85
T1, … Tn: pages point to page A
PR(A): PageRank of page A
PR(Ti): PageRank of page Ti pointing to page A
C(Ti): the number of links going out of page Ti
Proposed: PageRank
Properties of PageRank formula

PageRanks form a probability distribution over web
pages, so the normalized sum of all web pages'
PageRanks will be one
Challenge of calculating PageRanks

The links could be circulated, e.g., ABA
Page A
Page B
PageRank Calculation
Assign each page an initial rank value

Could be any number (seed)
Repeat calculations until the rank of each page does
not change much
Page A
Seed = 1
Page B
PR(A)= 0.15 + 0.85 * 1 = 1
PR(B)= 0.15 + 0.85 * 1 = 1
d= 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
PageRank Calculation
Assign each page an initial rank value

Could be any number (seed)
Repeat calculations until the rank of each page does
not change much
Page A
Page B
Seed = 0
1)
PR(A)= 0.15
PR(B)= 0.15
2)
PR(A)= 0.15
PR(B)= 0.15
3)
PR(A)= 0.15
+ 0.85 * 0 = 0.15
+ 0.85 * 0.15 = 0.2775
+ 0.85 * 0.2775 = 0.385875
+ 0.85 * 0.385875 = 0.47799375
+ 0.85 * 0.47799375 = 0.5562946875
PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375
d= 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
PageRank Calculation
Assign each page an initial rank value

Could be any number (seed)
Repeat calculations until the rank of each page does
not change much
Page A
Seed = 40
1)
PR(A)= 0.15 + 0.85 * 40 = 34.25
PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775
Page B
2)
PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875
PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375
3) ......
d= 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
PageRank Calculation
Assign each page an initial rank value

Could be any number (seed)
Repeat calculations until the rank of each page does
not change much
Page A
Seed = 40
1)
PR(A)= 0.15 + 0.85 * 40 = 34.25
PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775
Page B
2)
PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875
PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375
3) ……
Observation: It doesn’t matter what the seed value
you use, once the PageRank calculations settle down,
the “normalized probability distribution” (the average
PageRank for all pages) will be 1.0
Example of Calculation (0)
Page A
Page B
Page C
Page D
Example of Calculation (1)
Page A
1
Page B
1
Page C
1
Page D
1
Example of Calculation (2)
Page A
1
1*0.85/2
1*0.85/2
Page B
1
1*0.85
1*0.85
Page C
1
1*0.85
Page D
1
Each page has not passed on 0.15, so we get:
Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1
Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575
Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425
(from Page A) + 0.15 (not transferred) = 2.275
Page D: receives none, but has not transferred 0.15 = 0.15
Page A
1
Page B
0.575
Page C
2.275
Page D
0.15
Example of Calculation (3)
Page A
1
Page B
0.575
Page C
2.275
Page D
0.15
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375
Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575
Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) +
1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125
Page D: receives none, but has not transferred, remains at 0.15
Page A
2.03875
Page B
0.575
Page C
1.1925
Page D
0.15
Example of calculation (4)
After 20 iterations, we get
Page A
1.490
Page B
0.783
Page C
1.577
Page D
0.15
In reality: a PageRank for 26,000,000 web pages can be computed in
a few hours on a medium size workstation. (1998)
Result
Page C has the highest PageRank, and
page A has the next highest: page C has
a highest importance in this page links!
More iterations lead to a stability
PageRank of the resulting page for
keyword research.
PageRank Summary
PageRank is a citation importance ranking


Approximated measure of importance or quality
Number of citations or backlinks
The pages with high PageRanks are those that
are linked to by many pages and/or by
important pages (e.g., Yahoo!)
PageRank Summary
PageRank is a citation importance ranking



Approximated measure of importance or quality
Number of citations or backlinks
Each citation has different weight
The pages with high PageRanks are those that
are linked to by many pages and/or by
important pages (e.g., Yahoo!)
Questions: how to improve the ranking of your
web pages?


Creating dummy sites to link to their main sites?
Increasing internal links and/or decreasing
external links?
Google Architecture
Download